Review
Semantic SLAM: A comprehensive survey of methods and applications

https://doi.org/10.1016/j.iswa.2025.200591Get rights and content
Under a Creative Commons license
Open access

Highlights

  • Analyzes semantic SLAM’s role in enhancing localization and mapping in dynamic environments.
  • Highlights deep learning advances in semantic SLAM, object recognition, and scene understanding.
  • A review of key datasets, highlighting their role in evaluating and benchmarking semantic SLAM.
  • Reviews advanced monocular, stereo, and RGB-D semantic SLAM methods for their impact.
  • Key research gaps include limited work in dynamic settings and needs for better scalability, speed, and reliability.

Abstract

This paper surveys the different approaches in semantic Simultaneous Localization and Mapping (SLAM), exploring how the incorporation of semantic information has enhanced performance in both indoor and outdoor settings, while highlighting key advancements in the field. It also identifies existing gaps and proposes potential directions for future improvements to address these issues. We provide a detailed review of the fundamentals of semantic SLAM, illustrating how incorporating semantic data enhances scene understanding and mapping accuracy. The paper presents semantic SLAM methods and core techniques that contribute to improved robustness and precision in mapping. A comprehensive overview of commonly used datasets for evaluating semantic SLAM systems is provided, along with a discussion of performance metrics used to assess their efficiency and accuracy. To demonstrate the reliability of semantic SLAM methodologies, we reproduce selected results from existing studies offering insights into the reproducibility of these approaches. The paper also addresses key challenges such as real-time processing, dynamic scene adaptation, and scalability while highlighting future research directions. Unlike prior surveys, this paper uniquely combines (i) a systematic taxonomy of semantic SLAM approaches across different sensing modalities and environments, (ii) a comparative review of datasets and evaluation metrics, and (iii) a reproducibility study of selected methods. To our knowledge, this is the first survey that integrates methods, datasets, evaluation practices, and application insights into a single comprehensive review, thereby offering a unified reference for researchers and practitioners. In conclusion, this review underscores the vital role of semantic SLAM in driving advancements in autonomous systems and intelligent navigation by analyzing recent developments, validating findings, and highlighting future research directions.

Keywords

Semantic SLAM
Object-level SLAM
Visual SLAM
Semantic mapping
Dynamic environments
Autonomous systems

1. Introduction

Simultaneous Localization and Mapping (SLAM) is a technique that allows a mobile robot or device to create a map of its environment while simultaneously determining its position within that map (Azzam et al., 2020, Cadena et al., 2016). This dual capability is essential for autonomous systems to function effectively in unknown static or dynamic environments. In robotics, SLAM provides the essential ability for robots to navigate and interact with their surroundings without predefined maps. This capability is crucial for applications ranging from autonomous vehicles (Takleh, Bakar, Rahman, Hamzah, & Aziz, 2018) to indoor mobile robots (Fang et al., 2021, Tian et al., 2025, Yousif et al., 2015) and drones (Antonini et al., 2020, Fu et al., 2023, Motlagh et al., 2019). In autonomous navigation, SLAM allows vehicles to move safely and efficiently through complex and changing environments, such as urban streets or warehouses, by continuously updating their maps.
The primary goal of SLAM is to solve the coupled problem of mapping an environment and localizing the robot within that environment. This involves two main tasks: localization, which determines the position and orientation of the device, and mapping, which builds a coherent map of the environment based on sensory inputs. These tasks must be performed simultaneously because the accuracy of the map depends on the precise location of the device, and vice versa. This interdependence makes SLAM a challenging problem in robotics and computer vision.
Traditional SLAM methods are typically categorized into two main types: direct methods and indirect (feature-based) methods. Direct methods operate directly on raw image or depth data, such as pixel intensities or geometric measurements. These approaches estimate camera motion by minimizing the photometric or geometric error between consecutive frames, often using dense or semi-dense regions of the image. This makes them effective in environments with little texture, though they tend to be more sensitive to lighting variations. Prominent examples include LSD-SLAM (Engel, Schöps, & Cremers, 2014) and DSO (Engel, Koltun, & Cremers, 2017). In contrast, indirect methods, also called feature-based methods, first extract salient features from the environment, such as corners or blobs, using detectors like ORB, SIFT, or SURF. These features are matched across frames to estimate motion and reconstruct the environment. To refine both camera poses and 3D feature positions, many indirect SLAM systems employ bundle adjustment (Wang, Ma, Ren and Lu, 2021), an optimization technique that minimizes the overall reprojection error across multiple keyframes. Indirect methods are generally more robust to illumination changes and partial occlusions but may perform poorly in low-texture or repetitive environments. Notable examples include Parallel Tracking and Mapping (PTAM) (Klein & Murray, 2007) and ORB-SLAM (Mur-Artal, Montiel, & Tardos, 2015). Therefore, the foundation of many SLAM systems lies in the distinction between working with raw image data and tracking discrete visual features. Semantic SLAM builds upon both types by incorporating higher-level understanding, such as object labels and contextual information, which enables more accurate and intelligent mapping in complex, dynamic environments.
Traditional SLAM techniques primarily focus on the geometric and characteristic aspects of the environment. However, semantic SLAM introduces a significant advancement by incorporating high-level semantic information, allowing a better understanding of the environment in the scene, as shown in Fig. 1. The inclusion of semantic information allows for enhanced mapping by recognizing and labeling objects in the environment, creating maps that include not just the geometry but also the identity and function of different elements, which is more useful for tasks that require contextual awareness. Improved localization is achieved as semantic information helps in more robust data association and loop closure detection, reducing drift, and improving the accuracy of the SLAM solution. Semantic SLAM enables robots to perform more complex tasks that require an understanding of the environment, such as object manipulation, human–robot interaction, and autonomous driving in complex urban settings. It enhances performance in dynamic environments by effectively distinguishing between static and moving objects, enabling more adaptive, resilient navigation, and accurate mapping (Chen, Liu et al., 2022, Li, Ye et al., 2025, Qian et al., 2021, Ran et al., 2022).
The origins of SLAM can be traced back to the robotics and computer vision communities in the late 1980s and early 1990s. Initial approaches were primarily focused on probabilistic methods for basic localization and mapping. A seminal paper by Smith and Cheeseman in 1986 introduced the foundational concept of using probability theory to resolve uncertainties in robot navigation, which later became integral to SLAM (Smith & Cheeseman, 1986). The real turning point in SLAM research occurred in the late 1990s and early 2000s with the introduction of the EKF SLAM (Dissanayake, Newman, Clark, Durrant-Whyte, & Csorba, 2001). This method significantly improved the accuracy and reliability of the SLAM algorithms. As computational capabilities expanded, the mid-2000s saw the development of FastSLAM, a particle filter-based approach that addressed some of the scalability issues of EKF SLAM. FastSLAM provided a more efficient way to handle the SLAM problem by separating the mapping and the localization problems (Montemerlo, Thrun, Koller, & Wegbreit, 2002). A comprehensive tutorial on the state of SLAM detailed its challenges and emphasized the role of EKF in advancing the field (Durrant-Whyte & Bailey, 2006).
  1. Download: Download high-res image (160KB)
  2. Download: Download full-size image

Fig. 1. Hierarchical structure of semantic SLAM systems.

The introduction of Visual SLAM (vSLAM) marked a significant evolution in the field, replacing expensive laser sensors with cameras to provide richer environmental data at lower cost (Davison, 2003). This camera-based approach gained further momentum with PTAM system, which demonstrated that a single small camera could effectively perform simultaneous tracking and mapping, making SLAM technology more accessible and versatile (Klein & Murray, 2007).
Traditional SLAM systems create geometric maps but lack understanding of object identities and scene semantics. Semantic SLAM addresses this limitation by combining spatial mapping with object recognition, enabling robots to build meaningful environmental representations. A foundational contribution was the introduction of probabilistic data association methods that reliably link semantic observations across multiple viewpoints (Bowman, Atanasov, Daniilidis, & Pappas, 2017). This work demonstrated how to effectively combine semantic labels with SLAM, providing the groundwork for future developments. SemanticFusion combined dense 3D mapping with convolutional neural networks to label and understand the environment dynamically. This approach showed that real-time semantic mapping was feasible and practical (McCormac, Handa, Davison, & Leutenegger, 2017). The evolution from CNN-based labeling to sophisticated inference methods, such as dynamic dense CRFs that maintain semantic consistency across temporal sequences, demonstrates the field’s progression toward robust semantic understanding (You, Luo, Zhou, & Zhu, 2023).
The field of SLAM continues to evolve with advancements in deep learning and artificial intelligence. Semantic SLAM is increasingly focused on enhancing the interaction between autonomous systems and their environments (Chen, Wei, Lin and Lin, 2025), moving towards adaptive, intelligent systems. The detailed timeline of the evolution of SLAM and introduction of semantic SLAM can be visualized in Fig. 2. The integration of semantics into SLAM represents a pivotal advancement in robotics, offering the potential for more intuitive and intelligent machine behavior and significantly expanding the applications of robotic systems in complex environments.
Recently, interest in semantic SLAM and scene understanding has grown significantly, as evidenced by increasing publication numbers. Fig. 3 presents research trends from 2015 to 2025 for semantic SLAM, indoor scene understanding, and outdoor scene understanding, based on articles, proceedings, and book chapters from artificial intelligence, robotics, and engineering fields. The data reveals steady growth in semantic SLAM and indoor scene understanding research, while outdoor scene understanding shows particularly rapid growth in recent years, with continued expansion expected. Fig. 4 illustrates the distribution of semantic SLAM research across domains: robotics leads with 25.6% of publications, followed by automation (24.4%), communication (22.8%), and engineering (19.2%), while agriculture represents only 0.4% of the research output.
These trends reflect the growing interest and advancements in integrating semantic information and scene understanding in complex environments. Although significant progress has been made in geometric and feature-based SLAM from 2015 to 2025, research specifically dedicated to semantic SLAM remains limited. This presents a valuable opportunity for further exploration and development in this emerging field. Additionally, there is a limited number of review papers on semantic SLAM, highlighting the need for more comprehensive surveys and studies. Taking into account these gaps, the major contributions of this survey paper are listed below:
  1. Download: Download high-res image (385KB)
  2. Download: Download full-size image

Fig. 2. History and evolution of SLAM over the years from 1986 to 2025.

  1. Download: Download high-res image (275KB)
  2. Download: Download full-size image

Fig. 3. Illustration of the number of papers published between 2015 to 2025 in the areas of semantic SLAM, indoor scene understanding, and outdoor scene understanding, highlighting the evolving research focus and growing interest in these fields over time.

Digital Science (2024).
  1. Download: Download high-res image (216KB)
  2. Download: Download full-size image

Fig. 4. Pie chart illustrating the distribution of research areas in semantic SLAM, where 24.4% of the research is in automation and 25.6% in remote sensing.

  • Comprehensive coverage: We provide the most up-to-date and systematic review of semantic SLAM methods, filling the gap left by earlier surveys that focused only on geometric SLAM or isolated aspects of semantics.
  • Dataset and metrics analysis: Unlike prior reviews, we analyze commonly used datasets and performance metrics in semantic SLAM, highlighting their strengths, limitations, and suitability for different scenarios.
  • Taxonomy and categorization: We propose a structured taxonomy that organizes semantic SLAM approaches by sensor type, environmental context, and object representation, offering a clearer perspective than existing fragmented overviews.
  • Reproducibility and benchmarking: We reproduce selected results from prior studies, which have not been addressed in previous surveys, to provide insights into reproducibility and reliability of semantic SLAM methods.
  • Future outlook: We identify unresolved challenges, such as scalability, robustness in dynamic environments, and real-time semantic integration, and suggest research directions that go beyond the scope of earlier surveys.
The remainder of this paper is organized as follows (see Fig. 5): Section 2 describes the survey methodology. Section 3 presents the fundamentals of semantic SLAM, covering sensor types, environments, and approaches to 3D object representation. Section 4 reviews commonly used datasets for SLAM evaluation. Section 5 presents advancements in semantic SLAM methods. Section 6 highlights various applications of semantic SLAM while 7 discusses the practical challenges of semantic SLAM. Section 8 presents performance metrics for qualitative and quantitative assessment, including reproducing results from open-source implementations. Section 9 explores future research directions, and Section 10 presents our conclusions.
  1. Download: Download high-res image (268KB)
  2. Download: Download full-size image

Fig. 5. Schematic diagram of the overall paper structure discussing various sections and their sub-sections.

2. Survey methodology

Semantic SLAM has emerged as a pivotal area of research, yet several key challenges and gaps remain unaddressed in the current literature. While numerous surveys provide overviews of general SLAM technologies, there is a notable lack of comprehensive reviews specifically dedicated to semantic SLAM (Hughes et al., 2024, Xia et al., 2024). Compared to other forms of SLAM, semantic SLAM has been underrepresented in the literature. To the best of our knowledge, there are limited comprehensive surveys available on semantic SLAM. Most available surveys deal with different aspects of SLAM technologies, such as visual SLAM (Li, Wang et al., 2018, Pu et al., 2023, Sahili et al., 2023, Wang, Tian et al., 2024, Zhang et al., 2021), which focus on visual data for mapping and navigation. This gap highlights the need for a dedicated survey that consolidates existing research on semantic SLAM, evaluating methodologies, effectiveness, and applications. Our paper addresses this void by performing a systematic review of the available literature on semantic SLAM and proposing future research directions, considering the advantages and limitations of current approaches. It also gives a global view of semantic SLAM, exploring how semantic data can be integrated into SLAM frameworks and its far-reaching impact on enhancing robotic perception and autonomous navigation. This section outlines the survey methodology adopted in this study. We define the inclusion and exclusion criteria to ensure that only the most relevant and high-quality articles are considered, thereby maintaining the rigor and reliability of our systematic review.

2.1. Search strategy

We begin by reviewing several key articles, majorly focused on semantic SLAM and scene understanding. To ensure comprehensive coverage of the relevant literature, we create a carefully structured search query, combining several terms associated with semantic SLAM and scene understanding. One such query used is: semantic SLAM, semantic SLAM with scene understanding, indoor scene understanding, and outdoor scene understanding. To enhance the search results, we additionally used author keywords like semantic SLAM for scene understanding-related papers.
To ensure a comprehensive and unbiased review, we adhered to a strict survey methodology with clearly defined inclusion and exclusion criteria as illustrated in Fig. 6. Articles were selected based on the following focused criteria:
  • Paper Inclusion Criteria
    • Review articles, proceedings, and journals
    • Articles written in English
    • Articles published between 2015 and 2025
    • Articles focusing on semantic SLAM, indoor scene understanding, and outdoor scene understanding
    • Articles with SCI, SCIE, and Conference Proceedings citation index
    • Articles focusing on research areas like robotics, computer science, and engineering
  • Paper Exclusion Criteria
    • Each article will be counted only once, even if it appears in multiple digital libraries
    • Articles that are not peer-reviewed
    • Studies that do not have experimental results
    • Papers not directly related to the primary focus of semantic SLAM and robotics
    • Duplicate studies from other sources not listed
    • Papers not listed in Q1, with exceptions for significant contributions
  1. Download: Download high-res image (533KB)
  2. Download: Download full-size image

Fig. 6. Flowchart illustrating the systematic paper selection process employed in this survey, where duplicates were removed, followed by a screening of titles and abstracts for relevance. Articles were assessed against strict inclusion and exclusion criteria, focusing on semantic SLAM.

2.2. Paper selection results

To ensure a thorough and methodical exploration of the research landscape, a systematic literature review was conducted based on the above-mentioned criteria, and the final results of the selected papers is illustrated as bar plot in Fig. 7.
Fig. 7 presents the annual distribution of the 191 Web of Science articles included in our survey, spanning from 2015 to 2025. The data reveal a clear upward trend in the number of publications related to semantic SLAM over the past decade. Initial contributions were relatively sparse between 2015 and 2018, with fewer than 10 papers published each year. However, from 2019 onward, there has been a noticeable increase, reflecting growing academic interest and technological advancement in the field. A significant rise began in 2020, with the number of publications more than doubling compared to previous years. This growth continued steadily, peaking in 2024 with 41 articles, followed closely by 37 publications in 2025. The sharp increase from 2020 onward suggests that semantic SLAM has become a rapidly evolving and highly active research area, likely driven by advancements in deep learning, 3D perception, and autonomous systems. This trend underscores the increasing relevance and applicability of semantic SLAM, motivating the need for a structured and comprehensive survey such as the one presented in this paper.
  1. Download: Download high-res image (182KB)
  2. Download: Download full-size image

Fig. 7. Bar graph illustrating the number of publications per year in the Web of Science database from 2015 to 2025.

To enhance the breadth of our review, we also analyzed the related work sections of the initially selected articles. This allowed us to identify 123 additional papers relevant to SLAM techniques and scene understanding, bringing the total number of surveyed articles to 314. Additionally, the distribution of publication sources for the articles included in our survey is illustrated in Fig. 8. A significant majority of the reviewed works (84%) were published in peer-reviewed journals, reflecting the maturity and credibility of the research in the semantic SLAM domain. Conference papers accounted for 11%, while arXiv preprints contributed 3% and books comprised 2%. Although the proportion of arXiv articles is relatively small, those included in our study were carefully selected based on their high citation counts and demonstrated impact in the field, highlighting their relevance. After outlining the strict criteria we used to select articles, the next step is to provide the reader with the essential background of Semantic SLAM. This foundation will help contextualize the works we review later on.
  1. Download: Download high-res image (147KB)
  2. Download: Download full-size image

Fig. 8. A pie chart illustrating the distribution of the papers reviewed in our survey, showing that 84% of them were journal articles.

3. Background and fundamentals about semantic SLAM

Here, we provide the necessary background and fundamental concepts of Semantic SLAM. This section serves to familiarize readers with the core principles, terminologies, and evolution of SLAM, establishing the foundation for the more advanced discussions that follow. Semantic SLAM augments traditional mapping and localization with high-level semantics (Cadena et al., 2016). By adding labels to 3D structure and layout, it provides contextual meaning for robots and humans. Geometry captures points, edges, and depth from sensors; semantics adds understanding beyond shape, improving handling of moving obstacles. A key benefit is real-time fusion of semantics with geometry to assign object identities, whereas classic SLAM yields accurate but semantically empty maps. The result is a precise geometric map enriched with labels (e.g., wall, chair, vehicle) that supports applications like autonomous driving, navigation, and environmental analysis (Atanasov et al., 2016, Azzam et al., 2021, Bowman et al., 2017, Choe et al., 2022, Ran, Yuan, Zhang, He et al., 2021). Modern ML and vision classify scene elements to enable intelligent interaction (Grinvald et al., 2019, Mukherjee et al., 2021, Zhang et al., 2019). Robustness has improved via spatial-layout constraints and cross-view consistency (Ji et al., 2023); object-level methods refine associations and poses to keep identities consistent (Chen, Liu et al., 2022). Sparse GP regression further scales dense metric-semantic mapping for multi-robot use (Zobeidi, Koppel, & Atanasov, 2022).
Semantic extraction uses advanced deep learning to identify objects and support mapping. Beyond dense metric maps, text-based unsupervised segmentation builds topological semantic maps for assistive navigation without extensive labels (Sun, Ma, Zhou, & Cao, 2023). Together, these pipelines yield maps with both coordinates and context, enabling better robotic decision-making (Kostavelis and Gasteratos, 2017, McCormac et al., 2017, Tian et al., 2019, Zhou, Yue et al., 2023). Understanding component interactions in SLAM clarifies semantic SLAM’s role in autonomy (Cornejo-Lupa et al., 2020, Kostavelis and Gasteratos, 2015). Fig. 10, Fig. 9 detail the workflow, highlighting how semantic labels enrich geometry to support more sophisticated navigation–environment interactions (see Fig. 9).
Adding semantics to SLAM enables object and feature recognition, so systems map the environment and interpret object purpose and context. Work is closing the gap to higher-level cognition, allowing mapping with real-time semantic recognition and categorization. The following sections survey emerging integration strategies that drive semantic SLAM forward.
  1. Download: Download high-res image (265KB)
  2. Download: Download full-size image

Fig. 9. Integration of processes, sensors, and algorithms in SLAM systems.

Reviews often organize semantic SLAM by algorithmic style (feature-based, direct, dense), semantic integration (object- vs. scene-level), or application (indoor AR, driving, robotics). Many also contrast vision and LiDAR, or static scenes with semantic overlay versus dynamic-object SLAM. Here we adopt a more systematic view based on sensor modality and operating environment.
Organization by sensor type (monocular/stereo/RGB-D ) classifying by sensor (monocular, stereo, RGB-D, LiDAR, or fused) aligns methods with their hardware capabilities and limits. Sensor choice directly affects map representation and tracking accuracy (Chen, Xiao et al., 2025). Distinguishing single-sensor from fusion setups (e.g., visual–inertial, vision–LiDAR) reflects common practice for robustness. Grouping by sensor enables fair comparisons under the same information regime (e.g., stereo-based dynamic SLAMs).
Separating indoor/outdoor (and structured/unstructured) is also useful. Semantics and motion statistics differ: indoors feature walls, furniture, and people; outdoors include vehicles, pedestrians, and open spaces. Conditions like lighting, texture, and weather strongly affect performance. Example priors differ (planar ground outdoors; ceiling planes indoors). This split enables environment-wide comparisons (e.g., indoor RGB-D vs. outdoor stereo+LiDAR).
Putting these axes together gives a taxonomy such as: Monocular (with/without IMU) – Indoor, Stereo – Outdoor, RGB-D – Indoor, LiDAR – Outdoor, and their sensor-fusion variants. The scheme of categorizing by the differences in sensors or scenarios of use helps understand trade-offs: e.g., monocular SLAMs forego depth sensors but can use priors (scale) and are commonly applicable in feature-rich scenes, whereas LiDAR systems are superior outdoors, but have no semantics in RGB. Organizing a survey by sensor and environment narrows the field: it clusters methods with similar operating conditions together, and strengths/weaknesses comparisons have meaning. As another illustration, a comparison between two RGB-D indoor SLAMs can emphasize their semantic fusion form, but a comparison of an RGB-D indoor approach with a stereo-LiDAR outdoor approach would confuse these totally different conditions.

3.1. Sensors

Semantic SLAM systems can be categorized based on their primary sensing modalities, each offering distinct advantages for different applications and environments. Table 1 presents a comparison of various sensor-based approaches evaluated in this survey. The following subsections examine five major categories: monocular semantic SLAM, stereo semantic SLAM, RGB-D semantic SLAM, LiDAR semantic SLAM, multi-modal semantic SLAM, and incremental semantic SLAM.

Table 1. Advantages and disadvantages of different semantic SLAM sensor approaches.

TypeAdvantagesDisadvantages
Monocular semantic SLAMLow cost
Lightweight
High portability
Scale ambiguity
Limited depth perception
Vulnerable in dynamic scenes
Stereo semantic SLAMImproved depth accuracy
Enhanced environmental understanding
Robustness
Higher computational cost
Increased hardware complexity
Higher cost
RGB-D semantic SLAMRich sensory data
Ease of scene reconstruction
Handles dynamic environments well
Limited range
Sensitivity to lighting conditions
Higher energy consumption
3D-Lidar based semantic SLAMHigh accuracy
Robust to lighting conditions
Effective in large-scale environments
High cost
Complexity
Limited by weather conditions
Multi-modal semantic SLAMComprehensive understanding
Robustness
Improved accuracy
High computational cost
Increased system complexity
Costly
Incremental semantic SLAMContinuous mapping
Adaptability
Resource efficiency
Complex algorithm design
Drift
Limited scalability

3.1.1. Monocular semantic SLAM

Monocular semantic SLAM extends traditional SLAM by adding object recognition with a single camera, lowering hardware cost and computation for small-scale devices (Cadena et al., 2016). Its major drawback is scale ambiguity, and it degrades in low-texture scenes and with moving objects (Engel et al., 2014, Gao et al., 2024). Whereas classic SLAM maps geometric features (points, lines), the semantic variant augments maps with labels from recognized objects, enabling richer interaction with the environment. The pipeline estimates camera pose and a sparse geometric map, builds 3D object models from images, recognizes objects, and fuses their positions into the SLAM map with continual refinement. Two parallel threads operate: a monocular SLAM thread for pose and geometry, and an object-recognition thread that detects objects in view and estimates their poses; recognized objects are inserted and their 3D poses are updated as more frames arrive (Han and Yang, 2023, Sun et al., 2021). Recent advances strengthen data association with ensemble methods (Wu et al., 2020) and deep shape priors for accurate object modeling (Wang, Runz and Agapito, 2021); complementary work uses spatiotemporal consistency and graph constraints to filter out incorrect detections (Zhang, Yuan, Ran, Tao, & Wu, 2023). Geometric contour-based alignment further improves tracking under varying lighting by matching projected object boundaries across viewpoints (Lin, Wang, Xu, Zhao, & Chen, 2023). Building on these, newer frameworks introduce outlier-robust modeling via isolation forests (iForest) and semantic topological mapping, enabling higher-level tasks such as object-driven exploration and manipulation (Wu et al., 2023).
Building on the general pipeline in Fig. 10, Fig. 11 present two tightly coupled parallel threads—monocular SLAM and object recognition. The monocular SLAM thread uses an EKF to estimate camera motion and map features, employing a 1-point RANSAC-EKF for resilient data association. Upon object recognition, SLAM-derived camera pose (position and orientation) is leveraged to augment the estimate. Detected objects are transformed from their own reference frame into the SLAM frame and inserted into the map, where their 3D poses are iteratively refined by the SLAM back-end.
In the object recognition thread, SURF features are extracted from the images, and their correspondences are computed against known object models using Nearest Neighbor Distance Ratio (NNDR). Afterward, RANSAC performs a geometric consistency check to identify valid transformations. For planar objects, a homography is estimated, solving the Perspective-n-Point problem to estimate the object’s translation and orientation. The object’s pose is then refined using inlier correspondences and incorporated into the SLAM system, enhancing the geometric map with semantic information and enabling real-time SLAM.
A key component of this process is the equation that transforms the object’s position and orientation from the object frame to the camera frame, as shown in (1): (1)HCkmO=HCkmFtCkmF,qCkmFHFOtOF,qOF,where tOF and qOF represent the position and orientation of face F in the object frame O, while tCkmF and qCkmF denote the position and orientation of face F relative to the SLAM camera Ckm. This transformation integrates the object’s features into the SLAM map by converting the object coordinates into the camera’s coordinate frame.
  1. Download: Download high-res image (317KB)
  2. Download: Download full-size image

Fig. 11. Monocular semantic SLAM parallel threads.

The SLAM state vector xk is then updated to include new object points. The state vector includes the camera position xCk, existing map points y1,,yn, and newly detected object points yWF: (2)xk=xCky1ynyWF.
By (2), object features enter the SLAM state via camera-frame coordinates, adding object points that EKF monocular SLAM can continually refine. Relocalization is also pivotal (Lee, Back, Hwang, & Chun, 2023a): when tracking fails—e.g., due to dynamics, dropped frames, or blur—the system realigns to a prebuilt map by matching current observations to known features, restoring robustness across varied settings (Chen et al., 2015, Civera et al., 2011, Li, Fu et al., 2025).

3.1.2. Stereo semantic SLAM

Stereo semantic SLAM is a leading paradigm that fuses stereo imagery with semantic cues for localization and mapping. It excels in dynamics, delivers stronger depth than monocular systems, and is more tolerant to illumination changes (Mur-Artal et al., 2015). The trade-off is higher compute and storage for stereo keyframes, which support optimization, loop closure, and relocalization—raising system cost (Meilland & Comport, 2013). To mitigate the “static-world” assumption of classic SLAM, stereo semantic methods employ self-instance segmentation and dynamic feature filtering (Bajpai et al., 2016, Li, Song et al., 2023), often building atop the well-established ORB-SLAM2 (Hu, Qi et al., 2025, Mur-Artal and Tardos, 2017, Zhai et al., 2024).
Concretely, stereo pairs are converted to depth, and parallel threads couple dynamic-feature filtering with ORB-SLAM2 to remain robust in motion-rich scenes. ORB features are extracted and their motion states inferred to separate static from dynamic points. Instance segmentation then contributes class-level semantics, partitioning frames into static and potentially dynamic regions; combining this with motion cues filters dynamic features. As in Fig. 12, four threads run concurrently: (1) pose tracking using static points, (2) local mapping, (3) loop detection/correction, and (4) dynamic-region recognition for filtering. This multi-threaded design preserves high-accuracy tracking and mapping despite moving objects (Bajpai et al., 2016, Li, Song et al., 2023).
Several key formulas and equations underpin stereo semantic SLAM. Absolute pose estimation and Essential matrix estimation are crucial for inferring the relative baseline motion between frames (Ai et al., 2023a), defined as shown in (3): (3)xTFx=0,where x and x are corresponding points in stereo images. The polar constraint through which the motion probability p is calculated for each object, given a certain camera pose P and fundamental matrix F, allows the differentiation between static and dynamic objects. The camera pose T is estimated using only static feature points with (4): (4)T=argminxiPXi2,where Xi refers to the 3D point and xi is its corresponding 2D point in the image (Tian, Yan et al., 2023, Venator et al., 2020, Zhou and Wang, 2025).
  1. Download: Download high-res image (377KB)
  2. Download: Download full-size image

Fig. 12. Stereo semantic SLAM parallel threads.

3.1.3. RGB-D semantic SLAM

RGB-D semantic SLAM integrates color and depth information to enhance environmental understanding, achieve more accurate 3D mapping, and improve adaptability to scene changes. Sensors typically have a short effective range and experience reduced performance in bright sunlight. Additionally, they consume more power compared to simple cameras. Automatic SLAM RGB-D semantic is an advanced robotics method used to create smart maps by combining visual sensing with semantic information. Unlike traditional SLAM, which maps the environment based on shapes and appearances, RGB-D semantic SLAM labels these maps descriptively with terms such as “chair”, “table”, or “wall”. This enhances the robot’s ability to recognize and locate objects within the environment (Arth et al., 2015, Ji et al., 2021, Wang, Luo et al., 2025). Maintaining semantic coherence in dynamic scenes is tackled by combining spatiotemporal consistency with probabilistic propagation, enabling reliable separation of static and moving objects while preserving map integrity (Chen, Ling, Gao, Sun, & Jin, 2023). Online systems go further by coupling 2D/3D detection with semantic landmark association, providing real-time updates and constraints that correct pose drift during SLAM (Hempel & Al-Hamadi, 2022). For exploration, information-theoretic planners use Bayesian multiclass octrees with Shannon mutual information to choose viewpoints that reduce both geometric and semantic uncertainty (Asgharivaskasi & Atanasov, 2023). These ideas translate to factory floors via lightweight segmenters and dynamic keypoint classifiers tuned for industrial complexity (Gou et al., 2022).
Semantics improve a robot’s task efficiency and interaction. Typical RGB-D semantic SLAM stacks three elements: (1) a dense SLAM core such as ElasticFusion (Memon et al., 2024, Whelan et al., 2016); (2) CNN-based semantic segmentation over RGB; and (3) Bayesian fusion of labels into the 3D map. ElasticFusion maintains a surfel-based model resilient to revisits; surfels are adjusted to remain consistent with the real scene. A CNN performs pixel-wise labeling via max-unpooling and deconvolution to produce class probabilities. As depicted in Fig. 13, threads run in parallel: SLAM tracks pose and integrates depth to update surfels’ geometry/color, while CNN predictions are registered via SLAM correspondences and fused probabilistically. Loop closure detects revisits and triggers global geometric optimization, changing surfel positions, normals, and semantic distributions. CRFs further regularize labels over surfels by enforcing spatial/appearance consistency, yielding rich, long-term consistent maps for advanced navigation (Li, Fan et al., 2023, Qin et al., 2022). Each surfel stores a class probability vector updated recursively, as in (5): (5)P(liI1,,k)=1ZP(liI1,,k1)P(Ou(s,k)=liIk),where P(liI1,,k) represents the updated probability of surfel s belonging to class li, given images I1 to Ik. P(Ou(s,k)=liIk) represents the CNN’s per-pixel probability output for the current image Ik, and Z is a normalizing constant.
The semantic predictions are further refined by incorporating the map’s geometry using Conditional Random Fields. This method ensures that all labels remain consistent with the surrounding context. The energy required for this labeling in a fully connected graph can be mathematically expressed in (6): (6)E(x)=sψu(xs)+s<sψp(xs,xs),where the unary term ψu(xs) is defined as the negative logarithm of the surfel’s internal probability distribution, which is illustrated in (7): (7)ψu(xs)=log(P(Ls=xsI1,,k)).
The pairwise term ψp(xs,xs) uses Gaussian edge potentials to enforce smooth predictions based on positional and appearance similarities, as handled by (8): (8)ψp(xs,xs)=μ(xs,xs)m=1Kw(m)k(m)(fs,fs),where k(m) are Gaussian kernels applied to the feature vectors fs of surfel s, and μ(xs,xs) is defined by the Potts model.
RGB-D semantic SLAM as implemented in the SemanticFusion framework combines mapping with object recognition. Injecting the ElasticFusion framework with CNNs and Bayesian updates can substantially boost the system. This integration allows for more accurate labeling in both 2D and 3D, enabling robots to become smarter and more efficient in their interactions and functions (Kostavelis and Gasteratos, 2017, Li, Gu et al., 2020, Li, Hu et al., 2021, McCormac et al., 2017, Tian et al., 2019, Zhou, Yue et al., 2023).

3.1.4. 3D LiDAR-based semantic SLAM

3D LiDAR-based semantic SLAM incorporates semantic information into conventional SLAM procedures, providing accurate depth measurements and detailed environmental information. It performs well in both dark and bright environments (Gong et al., 2021, Liu, Mi et al., 2021, Yang, Chen et al., 2023). This approach has been successfully extended to complex indoor environments, where robot-assisted mobile scanning systems combine LiDAR-based SLAM with deep learning semantic segmentation to achieve comprehensive 3D reconstruction and automated point cloud labeling of building interiors (Hu, Gan, & Yin, 2023). Furthermore, the 3D LiDAR-based SLAM is also particularly well-suited for outdoor and large-area mapping applications (Behley et al., 2019, Li, Zhang et al., 2020, Pu et al., 2025, Qiu et al., 2019, Ruiz-Sarmiento et al., 2017, Tang, Huang et al., 2023, Zhu, Yuan et al., 2025). Specialized applications have emerged for challenging natural environments, such as garden mapping systems that use semantic-based filtering to distinguish static structures from dynamic vegetation. This enables accurate static map construction through multi-frame and multi-resolution fusion techniques (Han et al., 2023). Recent advances have enhanced outdoor mapping through semantic-assisted topological map building, where online semantic segmentation enables adaptive node selection and robust place recognition in large-scale environments (He, Zhang, & Zhuang, 2022). However, LiDAR sensors are expensive and require sophisticated algorithms for data processing, with performance potentially hindered by weather conditions like rain or fog (Pak and Son, 2025a, Wei et al., 2024, Zhang and Singh, 2014). In 3D-LiDAR-based semantic SLAM, the general components of the 3D LiDAR-based semantic SLAM system, including tracking, mapping, and loop closing, are executed in parallel threads, which is illustrated in Fig. 14. These parallel processes work together to build and maintain a semantic map of the environment. These include (1) feature extraction, which identifies key features from raw LiDAR data crucial to understanding the environment’s structure, and (2) feature matching and data association, which correlate these features against a map or between frames to determine the robot’s position relative to its surroundings.
While these processes are running, semantic segmentation, powered by deep learning, classifies environmental components in the point cloud data. Object detection and tracking, which are parallel processes, identify and monitor objects over time. Pose estimation computes the exact position and orientation of the sensor, ensuring accurate localization. The system continuously updates and manages the map by incorporating new data, removing outdated information, and resolving inconsistencies. Optimization algorithms run in parallel, refining the map and trajectory quality by integrating new and historical data to minimize errors.
These processes are computationally intensive, making them essential for real-time data processing, which is critical in autonomous vehicles and robotic navigation in complex settings (Lou, Li, Zhang, & Wei, 2023). SLAM-generated maps are enriched with semantic information, significantly improving environmental understanding and navigation capabilities, particularly for applications such as autonomous driving. In semantic SLAM, point cloud data, conventionally denoted as P={pi}, where every pi is a 3D point, is processed to include semantic labels li. These labels classify each point into various classes based on its characteristics through a classification model. Since these semantically rich point clouds may originate from different viewpoints, they are aligned and integrated into a coherent map using transformation matrices T. This process is evaluated for effectiveness using the mean Intersection-over-Union (mIoU), as defined in (9), as part of the semantic KITTI dataset: (9)1Cc=1CTPcTPc+FPc+FNc,
Here, C is the number of classes under consideration, and for any class, c, TPc, FPc, and FNc are the numbers of true positives, false positives, and false negatives, respectively. True positives are the points correctly classified as class c; false positives are those incorrectly labeled as class c despite belonging to another class; and false negatives are points that belong to class c but were mistakenly classified as a different class. This metric averages the IoU across all classes, giving a holistic view of the model’s detection and classification accuracy. For example, in autonomous driving applications such as lane segmentation, it is crucial to have a precise and reliable understanding of the road environment, making these evaluations important for ensuring operational safety and effectiveness (Behley et al., 2019, Li, Zhang et al., 2020, Pugh et al., 2023, Qiu et al., 2019, Ruiz-Sarmiento et al., 2017, Tang, Huang et al., 2023).
  1. Download: Download high-res image (360KB)
  2. Download: Download full-size image

Fig. 14. 3-D LiDar-based semantic SLAM parallel threads.

3.1.5. Multi-modal semantic SLAM

Multi-modal semantic SLAM merges complementary sensors with semantics within a SLAM framework, enriching maps with object identity and attributes. Contemporary reviews emphasize the central role of deep learning in fusing heterogeneous sensors for robust autonomy (Tang, Zhao et al., 2023). Multiple modalities boost resilience to failures and noise but impose higher compute, tighter synchronization, and added hardware cost (Bresson et al., 2017, Rosen et al., 2021). Even so, the payoff is stronger navigation and interaction in complex settings (Chen et al., 2024, Chghaf et al., 2022, Xiao et al., 2025).
The system’s performance is further improved by parallel processing, that are shown in Fig. 15, enabling efficient handling of large-scale maps and complicated environments. Parallel SLAM algorithms distribute computation and memory load across multiple processors using out-of-core techniques. The factor graph is subdivided into sub-graphs, with local optimization for each segment while globally refining the whole graph. Selective updates are performed at the highest hierarchy level whenever new observations are made, concentrating computational resources on areas that change.
Additionally, some systems use a dual-thread strategy to separate the localization and mapping tasks. One thread optimizes and manages the local pose-feature graph, while another handles the pose-pose graph, with periodic synchronization. This separation ensures efficient task execution, which is important for real-time performance in robotic navigation and task execution in dynamic environments (Cadena et al., 2016). Mathematically, SLAM can be expressed as a MAP estimation problem, seeking to maximize the posterior probability of a map M and the robot’s trajectory x, given measurements z and controls u, by maximizing (10): (10)(M,x)=argmaxM,xp(M,x|z1:t,u1:t).
Viewed probabilistically, SLAM addresses a high-dimensional Bayesian filtering problem: predict with the motion model, then correct with measurements (Hardegger et al., 2016, Li, Jiang et al., 2025, Yang, Zhang et al., 2023, Zhou, Mei et al., 2023). Extensions now capture epistemic uncertainty in semantics, informing belief-space planning that accounts for both pose and label uncertainty (Tchuiev & Indelman, 2023). For object-level data association, Dirichlet process mixtures cluster detections by class/position/size without fixing the number of objects (Wei, Chen, Chi, Wang and Sun, 2023). Confidence-aware fusion has likewise improved crowdsourced semantic maps, weighting pixel-level reliability across heterogeneous sources (Wijaya et al., 2022). Beyond VB inference, factor-graph methods jointly optimize coupled problems such as trajectory estimation and auto-calibration within a unified system (Liu et al., 2023).
  1. Download: Download high-res image (400KB)
  2. Download: Download full-size image

Fig. 15. Multi-modal semantic SLAM parallel threads.

SLAM problems can also be formulated in terms of factor graphs, where nodes represent states and edges represent observational and motion constraints. The objective in factor graphs is typically to minimize the sum of squared differences between predicted measurements and actual observations, as shown in (11): (11)minxihi(xi1,xi2,,xik)ziΣi2,where hi represents the measurement functions that link the observed data zi to the robot’s states, accounting for the associated uncertainty Σi. In Multi-modal semantic SLAM, these models are further extended to include semantic tags and data from multiple sensors, enhancing the accuracy and richness of the robotic mapping and navigation.

3.1.6. Incremental semantic SLAM

Incremental semantic SLAM, a method that uses Visual–Inertial Odometry (VIO), ensures real-time map updates, adapts to dynamic environments, and reduces computational overhead by processing only new or changed data (Cao et al., 2025, Engel et al., 2017, Liu, Wu et al., 2024). While the advantages of incremental SLAM are clear, it faces issues such as accumulating inaccuracies over time, requiring sophisticated methods for effective implementation, and struggling with large-scale environments without regular global corrections (Bloesch, Omari, Hutter, & Siegwart, 2015). Note that recent frameworks have evolved to incorporate deep learning components that enhance performance in dynamic environments, with systems like SRVIO (Samadzadeh & Nickabadi, 2023). These methods achieve state-of-the-art results by intelligently fusing geometric constraints with semantic understanding to handle challenging scenarios that would cause traditional methods to fail.
The incremental SLAM process involves the continuous construction, updating, and refinement of a 3D mesh structure of the environment, relying on structural regularities for both mesh creation and state estimation. It is supported by parallel processes, which are crucial for simultaneously creating and updating a semantic map. These processes, that are shown in Fig. 16, cater to the extraction of semantic features from sensory data, detecting objects like doors and furniture using techniques such as deep learning. Meanwhile, key SLAM processes update both the map and the agent’s location as new data is received.
Data association correlates new observations with the existing map, ensuring accuracy and consistency. State estimation uses filtering techniques, such as Kalman or particle filters, to compute variables like position and orientation. Additionally, 3D mesh generation from VIO keypoints creates a representation of the environment that is regularly updated and refined using structural constraints to ensure geometric accuracy. These processes—such as constraint enforcement, which applies structural regularities such as planarity, and optimization techniques to minimize errors in map and trajectory estimates—run simultaneously. They interact to refine both the semantic map and state estimation, providing a robust and accurate environmental representation critical for autonomous navigation in complex environments (Tian et al., 2022).
Previous studies have focused on optimizing these processes by using detected structural regularities, such as planarity, to improve the accuracy and physical realism of the mesh. For every keyframe i, the variables are represented as shown in (12): (12)ξ=[Ri,pi,vi,bi],where Ri denotes the orientation of the IMU, pi is the position, vi indicates the velocity, and bi represents the biases of the IMU.
The factor graph representing the back-end optimization problem includes the state Xt and the measurements Zt, which are probabilistically interconnected as shown in (13): (13)p(Xt|Zt)p(Xt)p(Zt|Xt)=ϕ0(x0)lcΛtπΠtA(i,j)KtB,where A=ϕR(ρlc,ππ)δ(lc,π)and B=ϕIMU(xi,xj). ϕ0 represents the prior on the initial state, ϕR are the regularity factors that link the points of interest ρlc with planes ππ, and ϕIMU incorporates the IMU data between keyframes.
  1. Download: Download high-res image (299KB)
  2. Download: Download full-size image

Fig. 16. Incremental semantic SLAM parallel threads.

The MAP estimation, which aims to find the most probable state configuration based on the measurements, is formulated in (14): (14)XtMAP=argminXtr0Σ02+lcΛtπΠtδ(lc,π)rRΣR2+.
This estimation aims to minimize the sum of squared residuals, where each term’s distinct components address different aspects of the system’s behavior and its interaction with the environment. The regularity constraints, particularly those related to co-planarity, are stated in (15): (15)rR=nρlcd.
The constraint ensures that the landmark ρlc lies on the plane specified by the normal vector n and the distance d from the origin, enhancing the structural integrity of the generated mesh relative to real-world geometry (Rosinol, Sattler et al., 2019, Wen et al., 2020). This advanced approach not only improves the precision of state estimation but also produces a 3D mesh that is both accurate and representative of the actual environment, particularly in structured environments (Guo and Fan, 2022, Rosinol et al., 2020, Wu et al., 2025). Recent multi-robot extensions of these planar constraint-based methods show that high-quality metric-semantic reconstruction can be maintained across distributed systems (Tian et al., 2022).
Sensor fusion involves integrating data from LiDAR, cameras, and Inertial Measurement Units (IMUs) to improve accuracy and robustness in SLAM. This integration uses the strengths of each sensor type to alleviate their respective weaknesses, providing more reliable 3D mapping in complex and dynamic environments. In sensor fusion, system state estimates —namely rotation, velocity, and position —are adjusted, each governed by specific formulas that describe how sensor data are fused over time. For rotation updates, the equation is given as: (16)ΔR=exp(ΩΔt),where Ω is the skew-symmetric matrix of the angular velocity ω(t), and Δt is the time interval. Eq. (16) is important for updating the orientation of the sensor platform based on angular velocity measurements, transforming these readings into a rotation matrix that reflects the changes in orientation over the time interval.
The velocity update, as shown in (17): (17)Δv=R(t0)a(t)Δt,uses the initial rotation matrix R(t0) and the linear acceleration a(t) to compute changes in the sensor platform’s velocity over time. The initial orientation at the start of the time period is accounted for, allowing the acceleration to be applied properly within the global reference frame, ensuring accurate velocity updates as the sensor platform moves.
Finally, the position is updated using (18): (18)Δp=ΔvΔt+12a(t)(Δt)2,This equation integrates the change in velocity over time and adds the displacement due to constant acceleration, providing a full update of the platform’s position. In dynamic environments, sequential updates of rotation, velocity, and position are important for maintaining SLAM accuracy. The system rapidly adapts to new sensor inputs, reducing errors due to sensor noise or processing delays. These updates not only improve real-time operational capabilities in SLAM but also contribute to the overall system’s reliability and performance in increasingly complex and challenging conditions (Cai et al., 2024, Yu et al., 2022). Recent implementations have shown that incorporating semantic information during these incremental updates, such as through YOLOv4-based object detection, can further enhance localization accuracy. The combination of object detection with clustering algorithms provides semantic constraints that complement the geometric updates, resulting in more robust localization performance (Chai, Li, & Li, 2023). Beyond traditional object detection, visual–LiDAR fusion methods have demonstrated that salient obstacle detection can transform environmental features into reliable landmarks for mapping. Centroid and contour extraction from fused sensor data provides distinctive reference points for localization (Hu et al., 2022).

3.2. Environment

This section focuses on the application of semantic SLAM in both dynamic indoor and outdoor environments, highlighting the unique challenges and solutions for each context. In dynamic indoor environments, semantic SLAM must account for rapidly changing objects and obstacles, while outdoor environments introduce challenges such as varying terrain, weather conditions, and large-scale dynamic elements. By capturing semantic information, robots can achieve more robust localization and mapping in these complex and unpredictable settings.

3.2.1. Semantic SLAM for dynamic indoor environment

Visual semantic SLAM (VSLAM) applied in dynamic indoor environments seeks to enhance performance in indoor scenarios, where traditional methods usually fail in the presence of moving objects (Habibpour et al., 2024, Zhao et al., 2021). This work proposes a lightweight yet effective framework that integrates several parallel processes to improve the accuracy and robustness of the semantic SLAM system. First, the YOLOv7-tiny algorithm is applied to detect dynamic objects in the scene. Then, motion consistency detection and the Lucas–Kanade optical flow algorithm are employed to classify feature points as either static or dynamic. This approach aligns with recent developments in fast, semantic-aware motion detection. The fusion of depth information, feature flow, and semantic cues through probabilistic frameworks has proven effective for real-time dynamic object filtering (Singh, Wu, Do, & Lam, 2022). The dynamic points are filtered out, leaving only static points for mapping, which is crucial for maintaining the SLAM system’s precision in environments with changing indoor scenes. In addition, ATY-SLAM incorporates an adaptive thresholding scheme for keyframe selection, further refining the process by considering environmental dynamics. These parallel processes enable ATY-SLAM to effectively address the challenges resulting from dynamic environments, overcoming the limitations of traditional SLAM systems that assume static settings (Deng et al., 2025, Qi et al., 2023).
Delving deeper, the system initially incorporates the YOLOv7-tiny object detection model to recognize dynamic objects present in the scene. These resulting bounding boxes indicate areas where dynamic feature points are likely to appear. By removing these dynamic points, the system refines the feature set to focus on more quasi-static and stable points based on the model’s prediction. This process removes dynamic feature points, leaving a more stable and cleaner set of points for further analysis. Specifically, all feature points detected in a frame Fk can be expressed using (19): (19)Fk={f1,f2,f3,,fn},where dynamic points identified by YOLOv7-tiny are excluded from Fk, resulting in a refined set of static points denoted as Pk.
The method uses epipolar geometry to enhance motion consistency detection. The distance d of a feature point from the epipolar line is calculated to determine whether a point is dynamic or static, and is given by: (20)d=|P2FP1|X2+Y2where P1 and P2 represent the corresponding feature points across frames, and F denotes the fundamental matrix. Feature points that exceed a distance threshold ϵth are classified as dynamic. Optical flow is employed to estimate the motion vectors u and v of the feature points, based on the brightness constancy assumption, calculated as: (21)IxIyuv=Itwhere Ix, Iy, and It represent the image gradients along the x-axis, y-axis, and the temporal derivative at the feature point, respectively.
Additionally, an adaptive thresholding method is used to improve keyframe selection, which is paramount for accurate mapping and localization. This method works by analyzing changes in model observations and assessing the angles and distances of matching points across frames. This adaptive threshold, Tadaptive, is carefully fine-tuned based on these ratios of matching points, angular deviations, and other factors, ensuring that only the most reliable frames are selected as keyframes (Qi et al., 2023).

3.2.2. Semantic SLAM for dynamic outdoor environment

The presence of dynamic objects in outdoor environments poses significant challenges for visual SLAM systems, requiring sophisticated approaches to maintain accurate localization and mapping. Early solutions, like DS-SLAM, integrate semantic segmentation with dynamic object handling using five parallel threads: tracking, semantic segmentation, local mapping, loop closing, and dense semantic map creation. By combining semantic segmentation networks with a method for motion consistency checking, DS-SLAM detects and discards moving dynamic objects, such as people, present in the scene. This methodology greatly enhances localization accuracy by reducing the effect of dynamic elements. A key component of DS-SLAM is the semantic octo-tree map, a dense mapping technique that incorporates semantic labels. These labels embed a deeper understanding of the environment, enabling higher-order tasks such as environment interaction and advanced path planning (Ran, Yuan, Zhang, Tang et al., 2021, Wu et al., 2024, Yu et al., 2018a). The evolution from DS-SLAM to systems like DE-SLAM represents a shift toward handling increasingly dynamic scenarios, with each iteration improving the robustness against moving objects (Xing, Zhu, & Dong, 2022). More recent frameworks have evolved to handle increasingly complex outdoor dynamic scenes through advanced deep learning models. These refined motion estimation algorithms achieve superior performance in challenging real-world conditions (Wen et al., 2023).
The robust computation of the fundamental matrix and the distance from the epipolar line are core algorithms in DS-SLAM, ensuring geometric consistency between frames and rejecting outliers caused by moving objects. The fundamental matrix F is used to compute the epipolar line I=FP, where P represents the homogeneous coordinates of points in the image. This calculation helps determine which points remain consistent across sequential frames. Additionally, the distance D=|TPFP|X+Y is used to assess point consistency, where TFPP is the projection of the points and X and Y represent the spatial coordinates. Points with a distance exceeding a set threshold are classified as dynamically moving and are excluded from pose estimation.
In contrast, DS-SLAM maintains a semantic octo-tree map and updates the occupancy grid with log-odds scores. The log-odds score is given by l=logp1p, where p represents the probability of occupancy. Its inverse, p=elel+1, is used to update this probability over time. By accumulating evidence of occupancy through these log-odds scores, DS-SLAM ensures the map remains accurate and up-to-date, even in dynamic environments where objects may frequently move.
DynaSLAM extends ORB-SLAM2 by incorporating dynamic object detection through a combination of multi-view geometry and deep learning, while also reconstructing occluded areas of the scene using background inpainting. This capability is particularly important in applications requiring long-term autonomy and continuous learning of the environment. DynaSLAM runs through a structured, multi-threaded approach, where tasks such as tracking, semantic segmentation, local mapping, loop closure, and semantic map creation run in parallel. Real-time semantic segmentation is core to this system since it facilitates the detection of dynamic objects that could hamper the tracking and mapping processes. These advancements lead to reduced drift, more reliable trajectory estimation, high accuracy, and improved fidelity in mapping, particularly under challenging conditions like dynamic environments or varying lighting. Alternative approaches to dynamic SLAM have focused on feature weighting strategies, where detected semantic and geometric properties are used to assign reliability scores to features rather than completely removing them, offering a computationally lighter solution for real-time applications (Zhong, Hu, Huang, Bai, & Li, 2022). On the other hand, DS-SLAM takes a more comprehensive approach to using semantic data, integrating it more deeply into the system’s operations (Bescos et al., 2018, Li, Wang et al., 2021, Xie et al., 2021, Yang and Cai, 2024). The recent advancements on dynamic slam in both indoor and outdoor environments has been described in Table 2.

Table 2. Dynamic semantic SLAM method comparison.

ReferenceDynamic object detectionEnvironment suitabilityStrengthsWeaknesses
Gupta et al., 2015, Qi et al., 2025, Xu et al., 2019 and Yang, Ran, Wang, Lu, and Chen (2022)Combines Mask R-CNN instance segmentation with residual-based motion filteringIndoorDense object-level maps; tracks moving objects; reconstructs backgroundLow frame rate; COCO-dependent; needs GPU; not for large/outdoor scenes
Bescos, Campos, Tardós and Neira (2021) and Ying et al. (2023)Classifies ORB features as static/dynamic; combines geometry and semantic cuesOutdoorReal-time object-aware SLAM; full 6-DoF for camera/objects; good for drivingSparse maps; ignores segmentation delay; needs stereo/RGB-D; accuracy depends on segmentation
Ge, Zhang, Wang, Coleman, and Kerr (2023) and Gonzalez, Marchand, Kacete, and Royan (2022)Groups points by semantic class; motion modeled with mechanical constraintsOutdoorRobust tracking for objects (e.g., cars); accurate camera pose with dynamic objectsNo dense mapping; needs accurate segmentation/joint models; not for cluttered/indoor settings
Wang, Wu, Li and Yu (2024) and Judd and Gammell (2024)Scene flow clustering and multilabel RANSAC; no semantics requiredIndoor and OutdoorUnsupervised; tracks multiple motions without semantics; occlusion-tolerantHigh computation; not real-time; sparse maps; lacks object-level detail; needs stereo/depth sensors

3.2.3. Approaches to 3D object representation

A key aspect of Semantic SLAM is how the environment and objects within it are represented in 3D, since the chosen representation directly influences the system’s ability to perform accurate localization, mapping, and semantic understanding. Hence this section explains the approaches to 3D object representation in both indoor and outdoor environments. These object representations are broadly classified into two categories: Euclidean structured data and non-Euclidean structured data. Euclidean structured data encodes geometric information about the object’s shape, size, and spatial relationships using Euclidean space coordinates. On the other hand, non-Euclidean structured data offers a more flexible and expressive way to encode geometric information, especially for objects with curvature, irregularities, or topological complexity.
While point clouds and meshes can be considered as both Euclidean and non-Euclidean data depending on the scale of observation, we categorize them as non-Euclidean due to their often infinite curvature, self-intersections, and variable dimensions. Analyzing such data on a broader scale helps understand the overall features of 3D objects, which is useful for tasks such as object recognition and correspondence. Fig. 17 represents the different classifications of object representations, with a detailed description of each category provided in Table 3. Once the basic principles are established, it becomes important to look at the datasets that drive this field, since they form the basis for training, validation, and comparison of Semantic SLAM methods.
  1. Download: Download high-res image (167KB)
  2. Download: Download full-size image

Fig. 17. A detailed illustration of different types of 3D object representations commonly used in semantic SLAM.

Table 3. 3D object representations for semantic SLAM in scene understanding.

ReferenceData typeCharacteristicsAdvantagesChallenges
Chen, Shao et al., 2022, Choudhary et al., 2017, Li, Zhou et al., 2024, Nie et al., 2020, Peng, Zhao et al., 2024, Wen et al., 2021, Xie et al., 2022, Xu et al., 2020, Yang et al., 2020 and Zhu, Xiao and Fan (2025)Descriptors Describe geometric or topological characteristics
Capture shape, surface, and texture information
Object recognition
Shape similarity
Efficient 3D processing
Deformable shape handling
Large-scale scalability
Gong et al., 2021, Huang et al., 2023, Jung et al., 2025, Liu, Mi et al., 2021, Sandstrom et al., 2023, Wang, Tian et al., 2025, Yang, Chen et al., 2023, You et al., 2022 and Ying and Li (2023)Projections Convert 3D objects into 2D grids Retains key shape characteristics Information loss in dense tasks
Choi et al., 2015, Jin et al., 2020, Mascaro et al., 2022, Popovic et al., 2021, Rosinol et al., 2023, Rosu et al., 2020, Wang, Tian, Liu, 2025 and Yan, Wang, He, Chang, and Zhuang (2020)Volumetric (voxel/octree) Grid-based 3D space modeling Simple, structured encoding High memory cost
Poor resolution scalability
Cheng et al., 2023, Cheng et al., 2021, Dang et al., 2019, Deng et al., 2020, Kuang et al., 2022, Muthu et al., 2020, Yan et al., 2022 and Zhang, Zhang, Jin and Yi (2022)RGBD Combines color and depth info (2.5D) Cost-effective, accurate pose and scene understanding Struggles with noisy/incomplete data
An et al., 2022, He et al., 2024, Huang et al., 2024, Islam et al., 2024, Shi, Zha et al., 2020, Zheng et al., 2025 and Yang, Ye, Zhang, Wang, and Qiu (2024)Multi-view geometry Combine multiple 2D images for 3D reconstruction Reduces noise and occlusion
Tolerant to lighting issues
Sensitive to calibration errors
Not ideal for dynamic scenes
Bescos, Cadena et al., 2021, Kong et al., 2023, Li, Guo et al., 2025 and Ruan, Zang, Zhang, and Huang (2023)Neural field MLPs represent object surfaces Compact, watertight, coherent representation Complex temporal modeling
Requires large datasets
Han and Yang, 2023, Peng, Xu et al., 2024, Tian et al., 2024, Tschopp et al., 2021 and Wei and Wang (2018)Super quadrics (SQ) Compact 3D shape abstraction from point clouds Efficient representation with shape fidelity Training requires large datasets
Sensitive to temporal variance
Cho et al., 2020, Isele et al., 2021, Li, Fu et al., 2024, Li et al., 2022, Pan et al., 2024, Vishnyakov et al., 2021 and Zhang, Huo, Huang, and Liu (2025)Point cloud Unstructured 3D points without topology Flexible and detailed geometry Hard to model globally
Calibration sensitivity
Arshad and Kim, 2024, Duan et al., 2022, Fernandez-Cortizas et al., 2024, Liu, Yuan et al., 2024, Qian et al., 2022 and Zhang, Zhang, Liu, Naixue Xiong and Li (2024)Graphs Nodes as vertices; edges encode relationships Scalable and expressive for both local/global tasks High complexity
Hard to visualize large graphs
Herb et al., 2021, Rosu et al., 2020 and Wang, Zhang and Li (2020)Meshes Polygons and vertices define surface geometry Preserves structure for segmentation and matching Irregular structure hampers DL integration
Sensitive to resolution and noise

4. Datasets

Datasets play a central role in enabling research progress. Here, we present the most widely used datasets in Semantic SLAM used for evaluating both traditional SLAM and semantic SLAM systems, highlighting their characteristics and the role they play in training, benchmarking, and validating algorithms. These datasets encompass real-world captures and simulated environments, supporting various sensor modalities including visual, RGB-D, and LiDAR systems. They span diverse scenarios, from static indoor scenes to dynamic outdoor environments, enabling comprehensive benchmarking of SLAM algorithms. The following subsections provide concise descriptions of key datasets and their characteristics.

4.1. TUM RGB-D dataset

The TUM RGB-D dataset (Lin, Zhang et al., 2025, Sturm et al., 2012) is a popular benchmark for evaluating SLAM and visual odometry systems. It provides RGB-D images captured using a Microsoft Kinect sensor, along with ground truth poses obtained through a motion capture system. The dataset encompasses various indoor environments, including both static and dynamic scenes, offering a diverse and challenging test bed for visual SLAM algorithms. In this paper, the dataset sequences are used to evaluate the results discussed in Sections 8, 8.4, with the list of sequences detailed in Table 4.

Table 4. TUM RGB-D dataset sequences.

SequenceDescriptionImage sizeFrame rate
fr3_walking_xyzWalking sequence with significant translational motion in x, y, z directions640 × 480 pixels30 Hz
fr3_walking_staticStatic scene with minimal motion640 × 480 pixels30 Hz
fr3_walking_rpyWalking sequence with rotational motion in roll, pitch, yaw640 × 480 pixels30 Hz
fr3_walking_halfHalf walking sequence with moderate motion640 × 480 pixels30 Hz

4.2. KITTI dataset

The KITTI dataset (Geiger, Lenz, & Urtasun, 2012) is a widely recognized benchmark for evaluating computer vision and SLAM algorithms, particularly in the context of autonomous driving. It contains high-resolution stereo and LiDAR data acquired from a vehicle navigating through various environments, such as urban, rural, and highway settings. The dataset provides ground truth poses obtained from a GPS/IMU system, enabling precise evaluation of SLAM performance in real-world outdoor environments. The KITTI-360 dataset extends this benchmark with longer sequences, providing more comprehensive evaluation capabilities for semantic SLAM and scene understanding tasks in complex urban environments (Liao, Xie, & Geiger, 2023). The KITTI dataset is used to replicate the results discussed in Sections 8, 8.4, with the specific sequences outlined in Table 5.

Table 5. KITTI dataset sequences.

SequenceDescriptionImage sizeFrame rate
KITTI 00Urban environment with moderate traffic1242 × 375 pixels10 Hz
KITTI 01Highway environment with high-speed motion1242 × 375 pixels10 Hz
KITTI 02Urban environment with dynamic objects1242 × 375 pixels10 Hz
KITTI 03Rural environment with varying terrains1242 × 375 pixels10 Hz
KITTI 04Urban environment with sharp turns and occlusions1242 × 375 pixels10 Hz

4.3. BONN dataset

The Bonn RGB-D Dynamic dataset (Palazzolo, Behley, Lottes, Giguère, & Stachniss, 2019a) is a key resource for advancing research in RGB-D SLAM. It consists of 24 dynamic sequences and 2 static sequences, capturing activities such as box manipulation and balloon play, designed to test SLAM algorithms in realistic, dynamic environments. Each sequence is accompanied by ground truth sensor poses obtained from an Optitrack Prime 13 motion capture system, along with a 3D point cloud of the static environment, recorded using a Leica BLK360 terrestrial laser scanner. The dataset is formatted similarly to the TUM RGB-D dataset, facilitating compatibility with existing evaluation tools.

4.4. A1 and Jackal

These publicly available datasets were part of the Kimera-Multi project (Tian, Chang et al., 2023, Wan and Luo, 2025). They were gathered using Unitree A1 quadrupedal robots and Clearpath Robotics Jackal wheeled robots, both equipped with RealSense D455 RGB-D cameras, IMUs, and Velodyne 3D LiDAR. The sequences contain RGB images, depth images, compressed grayscale images, wheel odometry, and LiDAR point clouds. The data was recorded in various locations across MIT’s campus, including indoor and outdoor areas, underground tunnels, and an undergraduate dormitory. The diverse environments and revisited locations in these datasets make them particularly valuable for evaluating loop closure detection algorithms, where semantic features combined with traditional bag-of-words approaches have shown significant improvements in recognition accuracy (Sun, Wang, Ni, & Li, 2024).

4.5. uHumans2

This dataset was created using the Unity simulator, where humans are simulated as realistic 3D models with standard graphic assets. It is used for 2D segmentation tasks and was developed as part of the Kimera project. For benchmarking purposes, the simulator provides ground truth poses for both humans and objects. The dataset contains visual–inertia data for various scenes, both with and without dynamic objects, covering environments such as offices, apartments, subways, and neighborhoods (Rosinol et al., 2021).

4.6. CarSim

These datasets consist of simulated urban outdoor scenes within the TESSE environment, designed similarly to the uHumans dataset. The simulated car is equipped with four monocular cameras positioned at the front, rear, left, and right. Additionally, the dataset includes ground truth pixel-wise semantic labels for precise analysis (Abate et al., 2024, Zhi et al., 2024). This multi-camera setup is particularly relevant for autonomous parking research, where similar sensor configurations have been successfully deployed in both outdoor valet parking scenarios (Abate et al., 2023) and indoor parking environments with semantic object detection capabilities (Shao, Zhang, Zhang, Shen, & Zhou, 2022).

4.7. openLORIS

This dataset is built for lifelong SLAM of service robots. The data is collected using two cameras: the RealSense D435i, which provides RGB-D images and IMU measurements, and the RealSense T265 tracking module, which captures stereo fisheye images and IMU measurements. The dataset includes five scenes, each containing 2 to 7 sequences taken at different times. Ground truth robot poses for each scene are provided by the Optitrack MCS and Hokuyo LiDAR systems (Shi, Li et al., 2020).

4.8. BeVIS (Indoor parking dataset)

The BeVIS dataset specifically targets the challenging domain of indoor parking environments. This comprehensive benchmark provides ground-truth trajectories for evaluating SLAM systems in parking structures, where GPS-denied conditions and repetitive structural patterns pose unique challenges. The dataset supports the evaluation of tightly-coupled semantic SLAM frameworks that integrate front-view cameras, inertial sensors, and surround-view systems for robust localization and detection of semantic objects in parking scenarios (Shao et al., 2023).

4.9. Scenesv2

The Scenes dataset v2 includes RGB and depth images from 14 scenes featuring various furniture items such as chairs, coffee tables, sofas, and tables, along with a selection of objects from the RGB-D Object dataset, including bowls, caps, cereal boxes, coffee mugs, and soda cans. Each scene has ground truth annotations and is represented as point cloud data, generated by aligning multiple video frames using Patch Volumes Mapping (Lai et al., 2014, Shao et al., 2025).

4.10. Freiburg cars

This dataset comprises RGB video sequences of 52 cars, captured with a camcorder in a full 360° rotation. Each video contains approximately 1500 to 3500 frames, which are uniformly downsampled to around 120 frames to accelerate the 3D reconstruction process (Sedaghat and Brox, 2015, Shao et al., 2025).

4.11. Redwood-OS chairs

This dataset features a large and diverse collection of RGB-D and reconstructed models, ranging from shoes, mugs, and toys to grand pianos, construction vehicles, and large outdoor sculptures. The data was captured using PrimeSense Carmine cameras with a resolution of 640 × 480 pixels and a frame rate of 30Hz. Each scan consists of both color and depth images, with pixel values representing depth in millimeters (Choi, Zhou, Miller, & Koltun, 2016).
Further, to support both researchers and practitioners, we provide a summary of widely used open-source frameworks that extend traditional SLAM with semantic capabilities. Table 6 highlights prominent tools, their key features, and associated publications, offering practical resources for replicating results and advancing research in semantic SLAM.
The evolution from traditional SLAM datasets to semantically-annotated ones reflects the field’s progression toward scene understanding. Modern datasets not only provide geometric ground truth but also semantic labels, instance segmentation, and dynamic object annotations. These diverse datasets are often used for benchmarking algorithms against various challenges, though the need for more specialized datasets targeting specific applications continues to grow. Finally, understanding these strengths and limitations of available datasets naturally leads us to explore how these resources have supported recent advances in Semantic SLAM, especially in enhancing scene understanding.

Table 6. Prominent open-source frameworks for semantic SLAM and their key contributions.

FrameworkKey featuresApplicationsReference
ORB-SLAM3 (with Semantic Extensions)Multi-camera, stereo, and inertial SLAM; semantic object integration via Mask R-CNNRobust semantic SLAM across diverse environmentsCampos, Elvira, Rodr’iguez, Montiel, and Tardós (2020)
KimeraReal-time metric-semantic mapping; 3D scene graphs; integrates visual–inertial odometryRobot navigation, semantic scene understandingRosinol, Abate, Chang and Carlone (2019)
DROID-SLAMEnd-to-end deep learning-based dense SLAM; robust to dynamics; lightweightVisual odometry, dynamic scene trackingTeed and Deng (2021)
SemanticFusionCombines CNN-based semantic segmentation with ElasticFusion for dense mapsIndoor semantic mappingMcCormac, Handa, Davison, and Leutenegger (2016)
MaskFusionObject-aware SLAM; fuses instance segmentation with 3D reconstructionAugmented reality, dynamic object mappingRünz and Agapito (2018)
Co-FusionMulti-object segmentation and tracking in real-time; extends ElasticFusionDynamic SLAM with moving objectsRünz and Agapito (2017)
Semantic voxbloxIncremental volumetric mapping with semantic fusionLong-term mapping, mobile roboticsPalazzolo, Behley, Lottes, Giguère, and Stachniss (2019b)
PanopticFusionPanoptic segmentation integrated into dense SLAM pipelineScene understanding, semantic mappingNarita, Seno, Ishikawa, and Kaji (2019)
DS-SLAMDynamic Semantic SLAM using deep learning for segmentation and static/dynamic separationRobust localization in dynamic scenesYu et al. (2018b)
OpenVSLAM (with semantics)Versatile, modular SLAM with support for multiple camera models; extensible with semanticsLightweight robotics, reproducible experimentsSumikura, Shibuya, and Sakurada (2019)
MonoScene-SLAM (emerging)Combines monocular SLAM with 3D scene completion and semantic priors3D reconstruction from monocular camerasCao and de Charette (2021)

5. Advancements in semantic SLAM for scene understanding

Traditional SLAM techniques typically rely on geometric and probabilistic approaches, utilizing methods, such as feature-based tracking and EKFs, to estimate a robot’s pose and map its environment. While these approaches are effective, they often face challenges in complex and dynamic environments due to limitations in feature extraction and sensitivity to sensor noise. Additionally, these approaches primarily rely on geometric features and lack a deeper semantic understanding of the environment. In contrast, semantic SLAM techniques integrate semantic information, such as object categories and semantic segmentation, into the mapping and localization process. By incorporating this layer of semantic understanding, semantic SLAM algorithms can generate more meaningful maps that not only represent spatial layouts but also provide insights into the environment’s semantic content. This enables robots to make more informed decisions and interact more intelligently with their surroundings, opening up new possibilities for applications in areas like autonomous driving, robotics, and augmented reality.
Recent advances in semantic SLAM have focused on addressing the challenges of dynamic indoor and outdoor environments, which represent the most common and challenging real-world scenarios for autonomous systems. These environments, characterized by frequent changes, moving objects, and diverse sensor conditions, provide ideal testbeds for evaluating the robustness and adaptability of different semantic SLAM methods. In this section, we present the recent advancements in Semantic SLAM techniques, with a particular focus on indoor and outdoor scene understanding. We emphasize how different methods incorporate semantic information to improve mapping and localization, reflecting the current trends and innovations in the field.

5.1. Key approaches in indoor scene understanding

The quality of the global map is important for accurate localization. To address this, Fan et al. proposed a novel semantic SLAM method that builds an accurate point cloud map while generating bounding boxes and masks using BlitzNet. The approach enables the creation of depth-stable points by accurately matching features in dynamic regions (Fan et al., 2020). Similarly, Han et al. provided a detailed review of indoor semantic mapping, covering aspects such as spatial mapping, semantic information acquisition, and map representation (Han, Li, Wang, & Zhou, 2021a). Chen et al. presented an extensive survey of semantic SLAM, detailing recent developments and analyzing the extraction and processing of semantic information using state-of-the-art datasets (Chen, Xiao et al., 2025).
Zhu et al. proposed a dense SLAM system called NICE-SLAM, which creates a hierarchical scene representation using local information. This representation, optimized with pre-trained geometric priors, enables detailed reconstruction of large indoor scenes while being more scalable, efficient, and robust (Zhu et al., 2022). Similarly, Wei et al. introduced DO-SLAM, a novel SLAM algorithm built upon ORB-SLAM2 and designed to enhance localization accuracy and system robustness in dynamic environments. By introducing outlier detection, this approach aims to mitigate the impact of dynamic objects on SLAM performance, improving both accuracy and reliability in challenging scenarios (Wei, Zhou, Duan, Liu and An, 2023). Additionally, Yu et al. presented DS-SLAM, which also extends ORB SLAM2 for highly dynamic environments. Their method uses five threads-tracking, semantic segmentation, local mapping, loop closing, and dense semantic map creation to improve the localization accuracy (Yu et al., 2018a). In a similar vein, Eslamian et al. proposed Det-SLAM, based on ORB SLAM3 and Detectron2, which identifies and eradicates dynamic spots to accomplish semantic SLAM in dynamic situations (Eslamian & Ahmadzadeh, 2022). Xu et al. further advanced this line of research with HMC-SLAM, a robust RGB-D SLAM system that leverages hierarchical multidimensional clustering to detect and filter dynamic features, significantly enhancing pose estimation in highly dynamic scenes (Xu, Zheng, Pan, & Yu, 2025). Moreover, Kim et al. developed SimVODIS, a unified framework that simultaneously performs visual odometry, object detection, and instance segmentation in a single self-supervised architecture that enables both geometric and semantic understanding for downstream SLAM or perception tasks in complex environments (Kim, Kim, & Kim, 2022). These semantic-based approaches have evolved to include adaptive fusion mechanisms that assign dynamic probabilities to detected objects, allowing the system to intelligently adjust its reliance on semantic versus geometric information based on scene complexity (Jiao, Wang, Li, Deng, & Xu, 2022). The computational efficiency of semantic integration can be further improved through selective frame processing. For example, the semantic segmentation can be applied only to keyframes rather than every frame, achieving real-time performance without sacrificing localization accuracy (Lee, Back, Hwang, & Chun, 2023b). Lightweight visual odometry systems have taken efficiency further by integrating adaptive geometric-semantic feature processing that dynamically balances computational load based on scene dynamics. This approach enables robust performance even on resource-constrained platforms (Wei, Huang, Liu and Zhou, 2023). Beyond frame-level optimizations, deep learning approaches have also improved loop closure detection efficiency through weighted triplet loss functions that learn discriminative features for place recognition, reducing the computational burden of exhaustive frame matching (Dong et al., 2022).
Recent advancements in semantic SLAM for indoor environments, as highlighted by key research papers, underscore the integration of deep learning techniques like BlitzNet and Detectron2 to enhance semantic understanding and dynamic object detection within SLAM systems. Despite these strides, future research should prioritize further enhancing the robustness of SLAM systems in highly dynamic environments, optimizing them for real-time performance, and advancing the integration of multi-sensor data for improved accuracy and efficiency.

5.2. Key approaches in outdoor scene understanding

Recent approaches to outdoor semantic SLAM have emphasized robust handling of dynamic elements and environmental variability. Dynamic SLAM exemplifies this trend by integrating SegNet-based semantic segmentation with ORB-SLAM2, using spatial motion information to achieve a 39.5% accuracy improvement in challenging outdoor conditions (Wen et al., 2023). Lin et al. developed the DPL SLAM technique, which combines ORB SLAM3 with a line detector segment network for efficient pose estimation. They also incorporated CUDA-enabled YOLOv5 for object detection to extract semantic information and remove abnormal features. The authors claim that this novel algorithm excels by not relying on a single source of information, effectively handling both known and unknown dynamic objects (Lin, Zhang, Tian, Yu, & Lan, 2024). Similarly, Zhang et al. proposed a semantic-based visual SLAM technique using ORB SLAM3 with TensorRT-optimized YOLOX to detect humans and non-humans in both indoor and outdoor environments (Zhang & Li, 2023). RSO-SLAM by Qin et al. integrates instance segmentation and optical flow to enhance robustness and localization accuracy in dynamic scenarios. Using a “KMC kmeans + connectivity” algorithm and ORB SLAM2, it detects motion regions and effectively handles non-rigid objects and slow-moving targets. However, the system struggles when large moving objects dominate the field of view or when significant changes occur (Qin et al., 2024).
Li et al. proposed a VSLAM method based on ORB SLAM2 integrated with Deeplab v3+ that incorporates semantic information to eliminate the negative effects of dynamic objects on precise localization (Li, Song et al., 2023). Ai et al. developed a new stereo SLAM system that combines ORB SLAM2 with the deep learning model ENet to enhance the performance of camera pose and trajectory estimation. The author claims that this system is robust and practical, particularly in highly dynamic and complex urban environments (Ai et al., 2023b). Similarly, Esparza et al. proposed a stereo SLAM approach for both indoor and outdoor dynamic environments, using ORB SLAM2 with a neural network-based semantic segmentation and geometrical constraints to effectively eliminate dynamic objects (Esparza & Flores, 2022).
The synthesis of findings from papers on outdoor scene understanding highlights a notable shift towards integrating advanced semantic segmentation and deep learning techniques with traditional SLAM frameworks like ORB SLAM2 and ORB SLAM3. By incorporating methods such as Deeplab v3+, YOLOX, and ENet, these studies demonstrate significant enhancements in accurately localizing and mapping dynamic outdoor environments. This integration enables SLAM systems to effectively discern between stationary and moving objects, thus improving robustness in complex urban landscapes and varied outdoor conditions. However, challenges persist in scenarios involving large moving objects and rapid environmental changes, highlighting the ongoing need for further research. Overall, the fusion of semantic information with SLAM technologies promises advancements in autonomous navigation and spatial understanding, crucial for applications ranging from robotics to augmented reality in outdoor environments. A timeline diagram illustrating the most commonly implemented semantic SLAM systems for both indoor and outdoor scenes is shown in Fig. 18.
The figure illustrates the evolution of techniques employed for constructing semantic maps and enhancing scene understanding from 2017 to 2024. Notably, the adoption of event cameras in semantic SLAM has surged in popularity from 2019 to 2024, indicating a promising avenue for future research. This trend suggests that event cameras hold significant potential for extracting semantic features in highly dynamic scenarios, paving the way for further advancements in the field. Furthermore, Table 7 provides a benchmark comparison of different methodologies and sensors used in semantic SLAM under various scenarios. The comparison highlights the evolution of semantic SLAM methods from traditional ORB-SLAM extensions to more advanced approaches based on semantic graphs, Gaussian splatting, and vision–language models. Earlier methods such as Blitz-SLAM, RDS-SLAM, and RS-SLAM focus mainly on indoor environments and RGB-D sensors, offering reliable dynamic handling but limited applicability outdoors. In contrast, recent frameworks like Kimera2, Dynamic-SLAM, and SG-SLAM extend capabilities to outdoor and highly dynamic environments through multi-sensor integration (LiDAR, IMU, stereo). Emerging approaches such as OpenGS-SLAM, SGS-SLAM, and Hier-SLAM++ demonstrate the integration of foundation models and 3D Gaussian splatting, enabling more generalizable and semantically rich representations. Another important trend is the gradual increase in open-source availability (e.g., RDS-SLAM, Kimera2, SG-SLAM), which facilitates reproducibility and benchmarking. Overall, the table underscores a shift from geometry-centric pipelines toward semantically enriched, multimodal, and open frameworks designed to handle real-world complexities more effectively.
  1. Download: Download high-res image (284KB)
  2. Download: Download full-size image

Fig. 18. Timeline diagram for the most commonly known semantic SLAM techniques.

5.3. Emerging trends in semantic SLAM

Recent research in semantic SLAM has introduced new techniques that significantly change how systems understand and build maps of the environment. One major development is the use of a method called 3D Gaussian Splatting (3DGS), which allows systems to create detailed and semantically rich 3D maps much faster and more accurately than before. For example, methods like SGS-SLAM (Li, Liu et al., 2024, Yang, Wang et al., 2025) and OpenGS-SLAM (Chen, Zhang et al., 2025, Guerrero-Font et al., 2021, Li, Gu et al., 2018, Yang, Gao et al., 2025) use 3DGS to represent scenes using small 3D blobs (called Gaussians) instead of traditional pixels or point clouds. This leads to sharper object boundaries, faster map updates, and more precise object detection in the environment. Unlike older methods that use slow and blurry representations (like NeRF), these new techniques are more efficient and suitable for real-time use.
In addition, Hier-SLAM++ (Li, Hao et al., 2025, Zhang, Guo et al., 2024) goes a step further by combining 3DGS with powerful AI models such as SAM and CLIP, which have been trained on large datasets. This combination helps the system recognize and label a wide range of objects, even in unfamiliar environments, and works with both RGB-D and simple monocular camera inputs. These innovations make semantic SLAM systems smarter, faster, and more adaptable to real-world challenges.
Another emerging direction is the fusion of vision–language models with SLAM systems. For example, FindAnything (Abdelnasser et al., 2016, Laina et al., 2025) introduces an open-vocabulary SLAM framework that supports natural language queries during mapping, marking a shift toward generalizable, interactive scene understanding.
Additionally, SG-SLAM (Chen, Li et al., 2025, Dube et al., 2020, Wang, Lu et al., 2025) introduces a semantic-graph-enhanced LiDAR SLAM system. Instead of relying on point-wise labels, SG-SLAM constructs a robust object-level semantic graph, enhancing re-localization, loop closure, and global map consistency. It achieves real-time performance across challenging LiDAR datasets such as KITTI, MulRAN, and Apollo, outperforming both geometry-based and semantic SLAM baselines.
Collectively, these works demonstrate a move toward explicit, modular, and generalizable representations in semantic SLAM. They enable not only improved accuracy and robustness but also new capabilities such as zero-shot segmentation, open-set reasoning, and interactive manipulation, which are critical for next-generation applications in robotics, embodied AI, and autonomous systems. These developments illustrate how the field is progressing, but to truly assess their impact we must also consider the evaluation criteria used in the semantic SLAM. Therefore, the following section focuses on the performance metrics applied in Semantic SLAM research.

Table 7. Benchmark comparison of core techniques and characteristics of semantic visual SLAM systems.

ReferenceMethodTechniqueNetworkSensors usedPublic datasetsIndoorOutdoorDynamicAvailable
Fan, Zhang, Tang, Liu, and Han (2022)Blitz SLAMORB SLAM2BlitzNet, ResNet50RGBD CameraTUM RGBD
Lin et al. (2024)DPL-SLAMORB SLAM3YOLOv5-sIntel D435iTUM RGB-D, KITTI
Lv et al. (2024a)MOLO-SLAMORB SLAM2Mask-RCNNLiDAR, Kinect, RealsenseTUM, KITTI
Qin et al. (2024)RSO-SLAMORB SLAM2YOLOv5-seg, LiteFlowNet2ZED2i StereoTUM, BONN, KITTI
Zhao et al. (2022)KSF-SLAMORB SLAM2SegNetZED stereoTUM RGB-D, KITTI
Liu and Miura (2021b)RDS-SLAMORB SLAM3KinectV2TUM RGBD
Ran, Yuan, Zhang, Tang et al. (2021) and Xiong et al. (2023)RS-SLAMORB SLAM2PSPNetRGB-DTUM
Abate et al., 2024, Zheng et al., 2024 and Zhang, Song et al. (2025)Kimera2Pose Graph3D Dynamic Scene GraphLiDAR, RGBD, IMUA1, Jackal
Cheng et al. (2023)SG-SLAMORB SLAM2NCNNRGBD CameraTUM, BONN
Wu et al. (2020)EAO-SLAMORB SLAM2YOLOv3RGBD CameraTUM, Scenes V2
Li, Zou et al. (2023) and Luo, Rao, and Wu (2023)FD-SLAMORB SLAM3Fast-SCNN, Deepfillv2RGBD CameraTUM RGB-D
Wang, Runz et al. (2021)DSP-SLAMORB SLAM2LiDAR, Stereo, RGBDKITTI3D, Redwood Chairs
Wen et al. (2023)Dynamic SLAMORB SLAM2SegNetRGBD, IMU, LiDARKITTI
Cao et al. (2022) and Esparza and Flores (2022)STDyn-SLAMORB SLAM2SegNet + VGG16ZEDKITTI
Yang, Gao et al. (2025)OpenGS-SLAMGS + Semantic VotingSAM1.0, MobileSAMv2RGBD CameraReplica, TUM
Li, Liu et al. (2024)SGS-SLAMSemantic GSCNN + Semantic LossRGBD CameraScanNet, TUM
Wang, Lu et al. (2025)SG-SLAMSemantic GraphSegNet4DLiDARKITTI, MulRAN
Li, Hao et al. (2025)Hier-SLAM++Hier GS + Semantic LossCLIP, SAMRGB-D, MonoReplica, TUM
Laina et al. (2025)FindAnythingVL Semantic SLAMCLIP, DINO, SAMRGB CameraReplica

6. Applications of semantic SLAM

Semantic SLAM has advanced well beyond theoretical development, with real-world deployments across domains such as intelligent industry, smart cities, healthcare, and agriculture. This section provides an application-oriented perspective. We highlight representative use cases, examine practical deployment challenges, and connect algorithmic choices to application needs.

6.1. Intelligent/precision agriculture

Simultaneous Localization and Mapping (SLAM) is a fundamental technology for autonomous robot navigation in unknown environments, providing a joint estimate of robot poses and the 3D location of landmarks (Qadri and Kantor, 2021, Tiozzo Fasiolo et al., 2023). Its applications in agriculture are expanding rapidly, driven by the need for precision agriculture, enhanced food security, and efficient resource management amidst global warming and a growing world population (Pak & Son, 2025b). SLAM enables robots to construct accurate 3D maps of agricultural fields, which is crucial for tasks such as plant phenotyping, crop counting, yield prediction, and intelligent irrigation.
However, agricultural environments present unique and significant challenges for traditional SLAM systems. These include highly unstructured and dynamic settings, varying illumination conditions (e.g., direct sunlight, shadows), lack of texture, repetitive patterns (e.g., rows of crops), wind-induced movement of plants, and the presence of dynamic objects like humans or other robots (Xie et al., 2024). To overcome these issues, robust agricultural SLAM systems increasingly rely on a combination of diverse sensor modalities and advanced data fusion techniques, particularly those incorporating semantic understanding.

6.1.1. Sensor modalities in agricultural SLAM

No single sensor is sufficient for robust SLAM in complex agricultural environments, necessitating a multi-sensor approach (Tiozzo Fasiolo et al., 2023). The most commonly employed and effective sensor modalities include:
LiDAR (Light Detection and Ranging): LiDAR is highly effective for acquiring 3D spatial information and reconstructing dense point clouds (Pak & Son, 2025b). It provides rapid access to precise surface information and is notably robust to varying external lighting conditions, a critical advantage in outdoor agricultural settings where cameras struggle. LiDAR’s high measurement range, often up to 100 m, makes it suitable for both closed environments like greenhouses and open fields such as vineyards and orchards. 3D LiDAR, in particular, offers a larger field of view and directly generates dense point clouds, making it highly suitable for comprehensive map reconstruction. For instance, 3D LiDAR has been utilized for intelligent irrigation by leveraging its water-absorbing property to define water point clouds and segment surface water areas, enabling path creation for UAVs. Challenges, however, include potential degradation in rainy conditions due to reflections from raindrops, which can lead to noisy data. It may also perform poorly in environments with long corridors or low plant heights.
Visual Sensors (RGB, Stereo, RGB-D Cameras): Cameras are cost-effective and provide rich environmental information, including fine details that LiDAR might miss. RGB-D cameras, which provide precise depth information through physical measurements, are used for target detection and image segmentation. However, cameras are generally not robust enough for unstructured outdoor environments. They are highly sensitive to illumination changes, requiring constant exposure adjustments that can interfere with visual feature tracking, which often assumes constant brightness. Other challenges include lack of texture, repetitive patterns, and environmental dynamics like wind-blown crops, which can cause traditional visual SLAM algorithms (e.g., ORB-SLAM2) to fail or lose track (Lv et al., 2024b). Depth cameras are also sensitive to sunlight and have limited measurable depth outdoors.
IMU (Inertial Measurement Units): IMUs provide essential information about the robot’s orientation (roll, pitch, yaw) and can fill temporal gaps between less frequent GNSS measurements due to their high update rates (Tiozzo Fasiolo et al., 2023). They are compact and have low power consumption, making them ideal for integration into robotic platforms. However, IMUs are prone to drift over time when used for dead reckoning, and their position estimates can become quickly inaccurate due to vibrations on rough terrain. For these reasons, IMUs are almost universally coupled with other sensors to provide robust orientation and mitigate drift.
GNSS (Global Navigation Satellite System)/RTK-GNSS: GNSS provides absolute position estimates, and with Real-Time Kinematic (RTK) corrections, it can achieve centimeter-level accuracy. This is vital for georeferencing collected data and ensuring global consistency of maps. However, its reliability is significantly reduced in densely vegetated areas due to signal blockage and multi-path reflection, a common issue in agricultural settings. RTK-GNSS also typically requires a reference station, adding to cost and setup complexity (Tiozzo Fasiolo et al., 2023).
Radar: Radar offers advantages as a robotic perception modality in adverse weather conditions, such as dust, fog, rain, and snow, where LiDAR and cameras may perform poorly. Radars also offer better penetration in vegetation and can detect occluded targets. However, they typically build 2D images, limiting their direct use for 3D map reconstruction (Tiozzo Fasiolo et al., 2023).
Given the inherent limitations of individual sensors in dynamic and complex agricultural environments, multi-sensor fusion is critical for achieving robust SLAM performance. Modern agricultural SLAM heavily leverages advanced data fusion techniques, with a strong emphasis on integrating semantic information. In conclusion, robust SLAM in agriculture demands a combination of LiDAR, visual, IMU, and RTK-GNSS sensors, particularly in challenging environments. These modalities are most effectively integrated using tightly coupled data fusion techniques like factor graphs, critically enhanced by deep learning-based semantic segmentation and object detection to intelligently handle dynamic elements and extract meaningful, robust landmarks within complex agricultural scenes. Future research will likely focus on developing lighter deep learning architectures, improving real-time performance, and further integrating geometric and semantic information to enhance the system’s ability to discern the motion state of targets and adapt to varying scales and conditions.

6.2. Intelligent industry and warehousing

In the logistics and warehousing domain, semantic SLAM has been successfully integrated into autonomous aerial inventory systems, enabling fast and reliable stock management without human intervention. Beul et al. (2018) developed a micro aerial vehicle (MAV) that performs fully autonomous stocktaking inside large warehouses, relying on a 3D LiDAR-based SLAM pipeline for localization and mapping, while integrating RFID readers and fiducial markers for item recognition. The system was evaluated in an operational warehouse, where it navigated narrow aisles, avoided static and dynamic obstacles such as forklifts, and maintained accurate pose estimation in highly self-similar environments. Experimental results showed robust navigation at velocities up to 2.1 m/s with a mean waypoint deviation of less than 10 cm, demonstrating industrial-grade accuracy and efficiency. This application highlights the practical relevance of semantic SLAM in Industry 4.0, where inventory automation reduces human workload, minimizes errors, and supports continuous real-time stock management. Importantly, it also illustrates how algorithmic advances in SLAM directly translate into measurable performance gains in intelligent industrial systems.

6.3. Autonomous driving

SLAM has become an indispensable technology in autonomous driving, enabling vehicles to achieve centimeter-level localization and robust perception in complex and dynamic road environments. In practice, autonomous vehicles must operate across diverse scenarios—highways, dense urban areas, tunnels, and adverse weather—where traditional GNSS or odometry-based localization methods often fail (Wang, Guo, Chen and Lu, 2025). SLAM systems leverage multi-sensor configurations, including LiDAR, cameras, and IMUs, to construct high-precision maps while simultaneously localizing the vehicle in real time. Semantic SLAM further augments this capability by integrating object-level understanding, allowing vehicles to detect and track dynamic agents such as cars and pedestrians, and to interpret traffic signs and lane markings as semantic landmarks. Case studies demonstrate that approaches like DynaSLAM and SG-SLAM significantly improve trajectory accuracy and robustness in urban driving scenes by filtering or modeling dynamic objects, while large-scale mapping frameworks now allow for the construction of kilometer-scale maps that can be continuously updated through crowd-sourced data. Despite these advances, deployment challenges remain: computational requirements are high, and maintaining robustness under variable lighting and weather conditions is still difficult. Nonetheless, SLAM-based localization and semantic scene understanding form the backbone of advanced driver assistance systems (ADAS) and are foundational for achieving safe and reliable Level 4+ autonomous driving in real-world intelligent transportation systems (Zheng, Wang, Rizos, Ding, & El-Mowafy, 2023).
By expanding the discussion of applications and deployment, this survey not only summarizes the state of semantic SLAM methods but also demonstrates their practical value across domains central to intelligent systems research and practice.

7. Practical challenges and deployment

While semantic SLAM has demonstrated remarkable progress in academic settings, the transition from laboratory prototypes to real-world deployment introduces a number of practical challenges. These challenges span hardware requirements, robustness under uncontrolled conditions, scalability in long-term operations, and the persistent gap between research systems and commercially viable products. Addressing these factors is critical to enable semantic SLAM to mature into a technology suitable for practitioners across domains such as intelligent industry, transportation, and agriculture.

7.1. Computational requirements

Semantic SLAM systems often combine geometric estimation with deep neural networks for object detection and segmentation, resulting in significant computational demand. Many approaches rely on GPU acceleration to achieve real-time performance, especially in dynamic environments where every frame must be processed at high frequency (Galagain, Poreba, & Goulette, 2025). On resource-constrained platforms, this creates a trade-off between accuracy and efficiency. While lightweight architectures such as MobileNet or Tiny-YOLO can partially mitigate these issues, they often sacrifice semantic richness. Designing architectures that balance accuracy, speed, and power consumption remains an open challenge, particularly for embedded and mobile robotic platforms.

7.2. Robustness in real-world environments

In real-world deployments, semantic SLAM must cope with factors rarely encountered in controlled laboratory experiments. These include dynamic objects such as humans, forklifts, or animals, changing illumination, sensor noise, and environmental variability. Esparza and Flores (2021) demonstrate that even state-of-the-art systems struggle when large moving objects dominate the field of view, while Han, Li, Wang, and Zhou (2021b) highlight the difficulties of maintaining semantic consistency in cluttered indoor scenes. Robustness therefore requires not only accurate semantic segmentation but also reliable filtering of outliers and adaptive fusion strategies that can adjust to environmental changes without degrading localization accuracy.

7.3. Scalability and long-term mapping

Scaling semantic SLAM to large and long-term deployments introduces further difficulties. Semantic maps must be continuously updated to remain relevant, yet doing so without introducing drift, redundancy, or inconsistency is non-trivial. For example, autonomous vehicles operating in urban environments require kilometer-scale semantic maps that must adapt to changing infrastructure, traffic patterns, and seasonal variations. Efficient map management, including storage, querying, and updating mechanisms, is thus essential for practical scalability (Galagain et al., 2025). Recent work has explored semantic graphs and incremental updates as solutions, but reliable large-scale deployment is still an active research challenge.

7.4. From research to commercial products

Finally, there exists a significant gap between academic prototypes and industrial-grade solutions. Academic research often evaluates systems on curated datasets under controlled conditions, whereas real deployment demands robustness, reliability, and maintainability. A recent review of autonomous forklifts (Fraifer et al., 2025) underscores that while many SLAM-based prototypes achieve promising results in warehouses, deployment in production settings faces hurdles related to safety certification, integration with enterprise systems, and cost of hardware. Bridging this gap requires not only algorithmic innovation but also attention to system engineering, reliability, and human–robot interaction in real industrial workflows.
In summary, although semantic SLAM has advanced significantly in terms of algorithms and benchmarks, addressing the challenges of computational efficiency, robustness, scalability, and deployment readiness remains key to translating research into practical impact. Overcoming these challenges will determine the technology’s adoption across intelligent cities, industry, agriculture, and other domains.

8. Performance metrics used in semantic SLAM

This section briefly reviews the evaluation metrics and comparison methods used in both qualitative and quantitative assessments of SLAM and semantic SLAM algorithms. Typically, SLAM systems are evaluated based on accuracy, focusing on positional error, rotational error, and computation time. We classify evaluation metrics into semantic mapping, geometric SLAM, and tracking metrics, corresponding to the core functional components of semantic SLAM systems. Semantic mapping metrics evaluate how well the system understands and labels the environment, geometric SLAM metrics measure the accuracy of pose estimation and map reconstruction, and tracking metrics assess performance in following dynamic objects over time. This structure helps clearly separate the different aspects of system performance and aligns with how these metrics are commonly used in the literature. A detailed sub-classification of these metrics is illustrated in Fig. 19, and selected metrics from each category are explained as follows. Metrics such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) directly reflect the accuracy of localization and mapping, which form the foundation for reliable semantic interpretation of the environment. Similarly, metrics related to semantic mapping and object recognition contribute to evaluating how well SLAM systems capture and represent the meaningful structure of a scene. By linking geometric accuracy with semantic consistency, these measures provide a more complete assessment of a system’s capability in advancing scene understanding.
  1. Download: Download high-res image (307KB)
  2. Download: Download full-size image

Fig. 19. Performance metrics for evaluation of semantic SLAM in scene understanding.

8.1. Tracking metrics

Accuracy (Karpyshev et al., 2022), precision (Hu, Wu et al., 2025, Liu, Lei et al., 2024, Shoukat et al., 2024, Tardioli et al., 2016, Vasilopoulos et al., 2022), and recall (Wang, Zheng, & Li, 2023) are key metrics for evaluating the performance of classification models in machine learning. Each metric provides insight into a different aspect of the model’s effectiveness, with its relevance varying depending on the use case. Accuracy represents the ratio of correct predictions to the total number of predictions, whereas precision measures the proportion of true positives among all positive predictions, typically expressed as a percentage. In contrast, recall, also known as sensitivity, determines how often the model correctly identifies the true class within the dataset. The balance between precision and recall is especially important in visual loop closure detection. Semantic-aware models must minimize both false positives (requiring high precision) and false negatives (requiring high recall) to maintain map consistency in dynamic environments (Osman, Darwish, & Bayoumi, 2023). These metrics can be calculated using the equations provided from (22) to (24): (22)Accuracy=Number of correct predictionsTotal predictions from samples (23)Precision=True PositivesTrue Positives+False Positives (24)Recall=True PositivesTrue Positives+False Negatives
F1 score measures the harmonic mean of precision and recall, as shown in Eq. (25) (Karpyshev et al., 2022): (25)F1=2PrecisionRecallPrecision+Recall
Additionally, Average Precision (AP) is a key performance indicator of detection metrics that reflects the trade-offs between precision and recall, with values ranging between 0 and 1. This metric is calculated using Eq. (26) (Xing et al., 2022): (26)AP=r=01P(R)dRwhere P(R) represents precision as a function of recall R. MOTA is an ideal performance metric for tracking multiple objects, features, or landmarks over time, and it is defined by Eq. (27) (Chen et al., 2019, Li, Wang et al., 2018, Sahili et al., 2023): (27)MOTA=1tFNt+FPt+IDSttGTtwhere FNt represents the number of false negatives, FPt the number of false positives, IDSt the number of identity switches at time t, and GTt the ground truth.

8.2. Semantic mapping metrics

Semantic metrics evaluate the system’s ability not only to map the environment and localize within it but also to understand and categorize the elements present. These metrics assess the performance of the system in integrating semantic information, such as identifying objects and their relationships, alongside traditional spatial mapping. Commonly used metrics include Intersection over Union (IoU) (He, Li, Wang, & Wang, 2023) and pixel accuracy (PA) (Han et al., 2021a). IoU is calculated based on the overlap between ground truth and predicted bounding boxes, as shown in Eq. (28): (28)IoUc=Area of OverlapArea of Union
Pixel accuracy (PA) is a common metric used to evaluate the performance of semantic segmentation, defined as the ratio of correctly classified pixels to the total pixels in the image, as shown in (29): (29)PA=j=1knjjj=1ktjwhere njj represents the total number of pixels that are both classified and labeled as class j, essentially corresponding to the number of true positives for class j. tj refers to the total number of pixels labeled as class j.
Mean Intersection over Union (mIoU) is an extension of IoU, used for evaluating multiple classes or segments (Liu, Sun and Liu, 2021), whereas Distance Intersection over Union (DIoU) calculates the distance between the centroids of predicted and ground truth regions (Jiang, Guo, Jiang, Hu, & Zhu, 2021), with the formulas given in Eqs. (30), (31). Additional metrics used to evaluate semantic SLAM include the t-test (Trejos, Rincón, Bolaños, Fallas, & Marín, 2022) and the Non-Parametric (NP) test (Wilcoxon, 1992). The t-test quantitatively assesses SLAM system performance by comparing the estimated trajectory with the ground truth, as shown in Eq. (32), whereas the NP test processes red and green point clouds and applies the Wilcoxon Rank-Sum test to verify the null hypothesis. (30)mIoU=1ccIoUcwhere c is the total number of classes, and IoUc represents the intersection of union for a specific class c. (31)DIoU=1IoUc+ρ2b,bgtl2where ρ(b,bgt) represents the Euclidean distance between the central points of the predicted bounding box b and the ground truth bounding box bgt; l is the diagonal length of the smallest enclosing box that covers both bounding boxes. (32)t=X̄μs/nwhere X represents the sample mean, μ is the population mean, s indicates the standard deviation, and n denotes the sample size.

8.3. Geometric SLAM metrics

Geometric SLAM metrics are crucial for evaluating both performance and reliability. They assess accuracy in positioning and mapping, robustness in handling diverse and dynamic environments, and consistency in trajectory and map generation. These metrics also measure computational efficiency, scalability to large environments, real-time performance, and user-centric factors like ease of integration and usability. Common error metrics include RMSE (Han and Xi, 2020, Zhu et al., 2024), ATE (Peng et al., 2025, Zhang, Wang et al., 2022), RPE (Guan et al., 2020, Peng, Ran et al., 2024), Mean (Zhang & Li, 2023), Median (Chen et al., 2020, Lin, Su et al., 2025), and Standard Deviation (Han & Xi, 2020), all of which quantify the differences between estimated and actual positions and trajectories. These are commonly expressed in meters (m), centimeters (cm), or millimeters (mm), based on the camera trajectory measured in different use cases. For rotational RPE, the error is generally measured in degrees, percentages, or degrees per 100 m (Qin et al., 2024). By using these metrics, developers can enhance the reliability and efficiency of semantic SLAM systems for various applications.

8.3.1. Absolute Trajectory Error (ATE)

Absolute Trajectory Error (ATE) is a crucial metric that calculates the error between the estimated camera trajectory and the ground truth. This metric becomes particularly important when evaluating SLAM systems in dynamic environments, where methods like SA-LOAM (Li, Kong et al., 2021) and Pseudo-Anchors (Yang, He, Zhuang, Wang and Yang, 2023) demonstrate improved trajectory accuracy through semantic feature integration (Deng et al., 2019, Huang et al., 2025, Zhao et al., 2019): (33)ATE=1ni=1npiestimatedpiground truth2where n represents the number of data points, and piestimatedpiground truth2 is the squared Euclidean norm between the estimated and ground truth positions for the ith sample.
A lower ATE indicates more accurate localization and mapping, helping developers identify and improve discrepancies in their semantic SLAM algorithms.

8.3.2. Relative Pose Error (RPE)

The Relative Pose Error (RPE) is a key metric in evaluating the performance of visual SLAM by measuring the accuracy of relative motion, or pose change, between consecutive frames or time steps (Cheng et al., 2023). It helps in assessing how well the semantic SLAM system tracks the incremental movement of the camera or sensor. The translational and rotational components of the RPE (RPEt and RPEr) are shown in (34), (35): (34)RPEt=1nk=1ntkestimatedtkgroundtruth2 (35)RPEr=1nk=1nlogqkestimatedqkground truth12where n represents the number of data points, tkestimatedtkground truth2 is the squared Euclidean norm of the predicted and ground truth translation vectors, and logqkestimatedqkground truth12 is the logarithm of the relative rotation between the predicted and ground truth quaternion vectors for the kth sample.

8.3.3. Root Mean Square Error (RMSE)

The Root Mean Square Error (RMSE) is a commonly used metric for quantifying the average magnitude of errors between estimated and true values, as shown in (36) (Liu and Miura, 2021a, Zhou, Tao et al., 2023): (36)RMSE=1ni=1nei2where ei represents the error between the estimated and ground truth values for the ith sample, and n is the total number of data points.

8.3.4. Statistical measures

In semantic SLAM systems, apart from other metrics, statistical measures such as mean, median, and standard deviation are also vital for evaluating system performance and reliability (Ahmed et al., 2023, Wu, Guo et al., 2022). The mean, or average, quantifies the central tendency of a dataset, making it key to assessing the overall performance of SLAM systems. The median, the middle value of an ordered dataset, provides a measure of central tendency that is less affected by outliers compared to the mean. Standard deviation indicates the amount of variation or dispersion within a dataset, showing how far the values deviate from the mean. By using these statistical metrics, developers and researchers can gain deeper insights into the performance, robustness, and reliability of SLAM systems, thereby facilitating better optimization and enhancement of algorithms and implementations (Wu, Zhao et al., 2022). The corresponding formulas are shown in (37), (38): (37)Mean Error=1ni=1npiestimatedpiground truth (38)Standard Deviation(σ)=1ni=1nxiμ2where n represents the number of data points, xi is the ith data point, piestimatedpiground truth is the Euclidean norm of the predicted and ground truth positions for the ith sample, and μ is the average value of the dataset.
In addition to discussing the various performance metrics used in semantic SLAM, a benchmark comparison of these metrics for both indoor and outdoor scenes is provided for the TUM RGBD, Bonn, and KITTI datasets in Table 8, Table 9, Table 10, respectively.
On the indoor TUM RGB-D benchmark (Table 8), semantically informed pipelines achieve centimeter-level trajectory errors. Fan et al. (2022) report the lowest ATE (0.0159 m) and the lowest translational RPE (0.0182 m), with Cheng et al. (2023) close behind (ATE 0.0175 m; RPEt 0.02196 m), while Wu, Guo et al. (2022) is notably higher (ATE 0.0546 m; RPEt 0.0315 m). Rotational accuracy is consistently strong for the top entries (RPEr 0.560.74). Only Qian et al. (2021) report a classification Accuracy (92.19%) and only Wu et al. (2020) report IoU (81.75%), highlighting heterogeneous metric reporting across different works. Overall, these results indicate that semantic integration chiefly benefits geometric accuracy on indoor RGB-D data, whereas inconsistent disclosure of semantic metrics limits strict cross-paper comparison.

Table 8. Benchmark comparison of performance metrics using the TUM RGBD datasets.

ReferenceAccuracy (%)ATE (m)RPEt (m)RPEr (deg)IoU (%)
Fan et al. (2022)
0.01590.01820.5785
Wu, Guo et al. (2022)
0.05460.03150.7417
Cheng et al. (2023)
0.01750.021960.5611
Qian et al. (2021)92.190.0429
Wu et al. (2020)
81.75
Bavle, De La Puente, How, and Campoy (2020)
0.0365
Table 9 highlights the variability of performance on the Bonn dataset, which is characterized by dynamic indoor scenes. He et al. (2023) achieve the lowest ATE and report rotational accuracy alongside moderate translational drift (ATE 0.0245 m; 14.3°; RPEt = 0.1878 m). Singh et al. (2022) show competitive results with the lowest reported translational RPE, while (ATE 0.062 m; RPEt = 0.069 m) (Li, Guo et al., 2025) also perform strongly (ATE 0.029 m). In contrast, Jiang et al. (2024) and Wu, Guo et al. (2022) present higher ATE values, reflecting less robustness to challenging dynamics. (ATE 0.123 m and 0.089 m respectively). Cheng et al. (2023) falls in the mid-range (0.0644 m). Overall, these results demonstrate that while several methods achieve sub-decimeter accuracy, robustness to dynamic variations remains inconsistent across approaches, underlining the need for more standardized reporting of translational and rotational errors.

Table 9. Benchmark comparison of performance metrics using the Bonn datasets.

ReferenceATE (m)RPEt (m)RPEr (deg)
He et al. (2023)0.02450.187814.2961
Singh et al. (2022)0.06200.0690
Jiang, Xu, Li, Feng, and Zhang (2024)0.1230
Wu, Guo et al. (2022)0.0890
Cheng et al. (2023)0.0644
Li, Guo et al. (2025)0.0290
Table 10 summarizes performance on the KITTI dataset, which is widely used for outdoor driving scenarios with high dynamic complexity. Qin et al. (2024) achieve strong relative pose performance, reporting the lowest translational and rotational RPE values, despite a higher ATE. (RPEt = 0.0072 m; RPEr = 0.002°; ATE 2.31 m). Lv et al. (2024a) show the largest ATE and translational RPE, indicating limited robustness in large-scale outdoor conditions. (ATE 3.53 m; RPEt = 1.807 m). By contrast, Esparza and Flores (2022) and Wang, Li et al. (2020) achieve more balanced results, while Esparza and Flores demonstrate a particularly low translational RPE. (ATE 1.33 m and 1.45 m; RPEt = 0.0233 m). Chen, Liu et al. (2022) report the only accuracy percentage but also the highest ATE, suggesting a trade-off between recognition accuracy and localization drift. (Accuracy 80.82%; ATE 4.61 m). Singh et al. (2022) focus on rotational performance. (RPEr = 0.87°). Overall, these results emphasize that while certain methods excel in pose accuracy, achieving consistently low ATE across challenging outdoor environments remains difficult, underscoring the inherent complexity of large-scale dynamic driving datasets.

Table 10. Benchmark comparison of performance metrics using the KITTI datasets.

ReferenceAccuracy (%)ATE (m)RPEt (m)RPEr (deg)
Qin et al. (2024)
2.31360.00720.0020
Lv et al. (2024a)
3.53431.8074
Wang, Li, Shen and Cai (2020)
1.32670.4815
Esparza and Flores (2022)
1.44930.0233
Chen, Liu et al. (2022)80.824.606
Singh et al. (2022)
0.87

8.4. Replication of results from open-source papers

Replication of results from open-source papers plays a crucial role in validating the reliability and generalizability of Semantic SLAM methods, as it allows researchers to benchmark existing approaches under consistent conditions. We evaluated selected semantic SLAM systems algorithms on the TUM and KITTI datasets. The benchmark metrics, consistent with those described earlier, were selected to demonstrate algorithm performance across diverse indoor and outdoor environments, including static and dynamic settings.

8.4.1. System specifications

The experiments were conducted on a high-performance computer with the following specifications:
  • Processor: AMD Ryzen 9 3950X 16-Core Processor with 32 threads, operating at a base clock speed of 2.2 GHz and a maximum clock speed of 3.5 GHz.
  • GPU: NVIDIA GeForce RTX 2080 Ti with 11 GB of VRAM, supporting CUDA version 12.1.
  • Operating System: Ubuntu 16.04.
  • Robotics Framework: ROS Melodic.
We employed widely-used datasets to ensure reproducible results and comprehensive evaluation:
  • TUM RGB-D dataset
  • KITTI dataset

8.4.2. Results

We present evaluation results for various semantic SLAM systems tested on both indoor and outdoor datasets. First, the results for the TUM RGB-D indoor dataset are presented, followed by the KITTI outdoor dataset, which further highlights the advantages of semantic SLAM in handling challenging outdoor scenarios.
  • TUM dataset: The following evaluation results were obtained using the TUM RGB-D indoor dataset, with a focus on the sequences fr3_walking_xyz, fr3_walking_static, fr3_walking_rpy, and fr3_walking_half, as presented in Tables 11, 12, and 13, respectively. Key performance metrics such as ATE, RMSE, RPEt, and RPER, are used to highlight the effectiveness of integrating semantic information in enhancing SLAM accuracy in dynamic environments. The integration of semantic information in dynamic SLAM systems significantly enhances their performance compared to traditional methods like ORB-SLAM2. This improvement is evident in the superior results across various metrics and sequences from the TUM dataset. For instance, SG-SLAM achieves an ATE of 0.019 m in the fr3_walking_xyz sequence, vastly outperforming ORB-SLAM2’s 0.693 m. Additionally, SG-SLAM’s RPEt for the same sequence is 0.022 m, which is significantly better than ORB-SLAM2’s 0.475 m.
  • KITTI dataset: The KITTI dataset, particularly in outdoor environments, further highlights the benefits of semantic SLAM. For instance, VDO-SLAM with Mask R-CNN records an ATE of 1.2 m in the KITTI 00 sequence, compared to ORB-SLAM2’s 1.3 m. Furthermore, VDO-SLAM achieves a RPEt of 0.06 m in the same sequence, compared to ORB-SLAM2’s 0.04 m. These results emphasize the importance of semantic data integration, which enhances scene understanding by allowing the system to distinguish between different objects and dynamic elements. Tables 14, 15, and 16 present a comprehensive evaluation of the performance of different SLAM systems integrated with semantic information, including Dyna-SLAM and VDO-SLAM, both of which use Mask R-CNN for semantic integration, compared to the traditional ORB-SLAM2 system, which does not employ semantic data. The evaluation is based on KITTI outdoor dataset sequences, focusing on key metrics such as ATE, RPEt, and RPEr.

Table 11. Evaluation of semantic SLAM systems on TUM datasets using ATE performance metric.

SequenceSG-SLAM (NCNN) (Cheng et al., 2023)Dyna-SLAM (Mask R-CNN) (Bescos et al., 2018)DS-SLAM (SegNet) (Yu et al., 2018a)YOLO-SLAM (darknet19-yolov3) (Wu, Guo et al., 2022)RDS-SLAM (Mask R-CNN) (Liu & Miura, 2021b)RDS-SLAM (Segnet) (Liu & Miura, 2021b)
ResultsOriginalOursOriginalOursOriginalOursOriginalOursOriginalOursOriginalOurs
fr3_walking_xyz0.01520.0190.0150.0160.0240.0230.0140.0130.0210.0210.0570.056
fr3_walking_static0.0070.0080.0070.0060.0080.00780.0070.0060.0810.0780.020.02
fr3_walking_rpy0.0320.0340.1360.1350.4430.4440.2160.2230.1460.1450.160.159
fr3_walking_half0.0260.0230.0290.0290.030.030.0280.0280.0250.0300.080.08

Table 12. Evaluation of semantic SLAM systems on TUM dataset using RPEt performance metric.

SequenceSG-SLAM (NCNN) (Cheng et al., 2023)Dyna-SLAM (Mask R-CNN) (Bescos et al., 2018)DS-SLAM (SegNet) (Yu et al., 2018a)YOLO-SLAM (darknet19-yolov3) (Wu, Guo et al., 2022)RDS-SLAM (Mask R-CNN) (Liu & Miura, 2021b)RDS-SLAM (Segnet) (Liu & Miura, 2021b)
ResultsOriginalOursOriginalOursOriginalOursOriginalOursOriginalOursOriginalOurs
fr3_walking_xyz0.01940.0220.0210.0220.0330.0330.0190.0190.0280.0280.0420.043
fr3_walking_static0.0100.0130.0080.0090.01020.0110.0090.00870.0410.0420.0220.022
fr3_walking_rpy0.0450.0740.0440.0450.1500.150.0930.0920.1110.1110.1320.132
fr3_walking_half0.0270.0320.0280.0280.0290.0290.0260.0270.0280.0270.0480.051

Table 13. Evaluation of semantic SLAM systems on TUM dataset using RPEr performance metric.

SequenceSG-SLAM (NCNN) (Cheng et al., 2023)Dyna-SLAM (Mask R-CNN) (Bescos et al., 2018)DS-SLAM (SegNet) (Yu et al., 2018a)YOLO-SLAM (darknet19-yolov3) (Wu, Guo et al., 2022)RDS-SLAM (Mask R-CNN) (Liu & Miura, 2021b)RDS-SLAM (Segnet) (Liu & Miura, 2021b)
ResultsOriginalOursOriginalOursOriginalOursOriginalOursOriginalOursOriginalOurs
fr3_walking_xyz0.5040.5040.6280.6270.8260.8340.5980.5880.7230.0280.9220.919
fr3_walking_static0.2670.2700.2610.2710.2690.2690.2620.3421.1680.0420.4940.540
fr3_walking_rpy0.9560.9570.9891.0023.0123.0001.8231.8239.3190.11113.17013.210
fr3_walking_half0.8110.8120.7840.7760.8140.8120.7530.7520.8210.0271.8761.874

Table 14. Evaluation of semantic SLAM systems on KITTI dataset using ATE performance metric.

SeqDyna-SLAM (Mask R-CNN) (Bescos et al., 2018)VDO-SLAM (Mask R-CNN) (Zhang, Henein, Mahony, & Ila, 2020)
ResultsOriginalOursOriginalOurs
KITTI 001.41.21.21.2
KITTI 019.410.18.98.7
KITTI 026.77.15.45.7
KITTI 030.60.60.60.6
KITTI 040.20.30.20.2

Table 15. Evaluation of semantic SLAM systems on KITTI dataset using RPEt performance metric.

SeqDyna-SLAM (Mask R-CNN) (Bescos et al., 2018)VDO-SLAM (Mask R-CNN) (Zhang et al., 2020)
ResultsOriginalOursOriginalOurs
KITTI 000.040.030.0670.072
KITTI 010.050.040.0440.044
KITTI 020.040.050.0210.020
KITTI 030.060.040.030.04
KITTI 040.070.060.050.05

Table 16. Evaluation of semantic SLAM systems on KITTI datasets using RPEr performance metric.

SeqDyna-SLAM (Mask R-CNN) (Bescos et al., 2018)VDO-SLAM (Mask R-CNN) (Zhang et al., 2020)
ResultsOriginalOursOriginalOurs
KITTI 000.060.050.070.072
KITTI 010.040.030.0120.003
KITTI 020.030.030.040.04
KITTI 030.040.040.080.078
KITTI 040.060.050.110.10

8.4.3. Benchmarking against ORB-SLAM2

To establish a comprehensive benchmarking framework, all tested systems were evaluated under identical conditions using widely accepted datasets (TUM RGB-D and KITTI) and metrics, including ATE, Translational RPE (RPEt), and Rotational RPE (RPEr). ORB-SLAM2, which lacks semantic integration, was chosen as the baseline system to provide a reference point for evaluating the impact of semantic methods.
The benchmarking results on the TUM RGB-D dataset show significant performance differences between ORB-SLAM2 and semantic SLAM systems. For the sequence fr3_walking_xyz, ORB-SLAM2 achieved an ATE of 0.693 m, whereas SG-SLAM recorded an impressive 0.019 m, reflecting a 97.3% improvement. Similarly, for the fr3_walking_static sequence, SG-SLAM achieved an ATE of 0.008 m compared to ORB-SLAM2’s 0.392 m, highlighting its superior capability in both static and dynamic scenarios. These results are shown in Table 17.
The RPEt results further emphasize the impact of semantic methods. In the fr3_walking_xyz sequence, ORB-SLAM2’s RPE was 0.475 m/frame compared to SG-SLAM’s 0.022 m/frame, demonstrating a significant reduction in translational drift due to semantic integration. This is clearly shown in Table 18.

Table 17. ATE comparison on TUM RGB-D dataset.

SequenceORB-SLAM2 (m)SG-SLAM (m)
fr3_walking_xyz0.6930.019
fr3_walking_static0.3920.008
fr3_walking_rpy1.0220.034
ORB-SLAM2 performed moderately well on static sequences in the KITTI dataset but showed considerable limitations in dynamic scenarios. For instance, in the KITTI 01 sequence, which involves high-speed motion on highways, DynaSLAM recorded an ATE of 10.1 m compared to ORB-SLAM2’s 10.4 m, showing a modest improvement. For the KITTI 00 sequence, DynaSLAM achieved a slightly improved ATE of 1.2 m over ORB-SLAM2’s 1.3 m. These results illustrate that semantic integration provides more robust performance in handling dynamic elements, though its impact varies depending on the complexity of the scene. Results are displayed in Table 19.

Table 18. RPEt comparison on TUM RGB-D dataset.

SequenceORB-SLAM2 (m/frame)SG-SLAM (m/frame)
fr3_walking_xyz0.4750.022
fr3_walking_static0.3610.013
fr3_walking_rpy0.4510.074
A comparative analysis reveals the following key insights:

Table 19. ATE comparison on KITTI dataset.

SequenceORB-SLAM2 (m)SG-SLAM (m)
KITTI 001.31.2
KITTI 0110.410.1
KITTI 025.77.1
  • Accuracy Improvements: Semantic SLAM systems consistently outperformed ORB-SLAM2 across both datasets. SG-SLAM demonstrated a 97.3% improvement in ATE for the fr3_walking_xyz sequence, while DynaSLAM showed marginal gains for challenging KITTI sequences, such as KITTI 01.
  • Dynamic Object Handling: Semantic methods significantly reduced errors in sequences with high dynamic content. For example, SG-SLAM reduced RPEt in the fr3_walking_xyz sequence by over 95% compared to ORB-SLAM2.
  • Efficiency Trade-Offs: Semantic systems, while more accurate, often require higher computational resources. RDS-SLAM, however, provided a balance between accuracy and efficiency, achieving notable reductions in processing time compared to ORB-SLAM2.
These results highlight the transformative potential of semantic integration in SLAM, enabling systems to handle dynamic environments more effectively and improve trajectory accuracy. The benchmarking also highlights the importance of standardized evaluation frameworks to ensure consistent comparisons and drive advancements in the field.

8.4.4. Processing time

Table 20 shows the average processing time per frame for various dynamic SLAM systems, each using different semantic algorithms and tested on the previously specified hardware. The results highlight the computational efficiency and performance variations across the systems, ranging from ORB-SLAM2, which does not use semantic algorithms, to more complex setups like YOLO-SLAM and DynaSLAM. The latter systems demonstrate significantly higher processing times due to their advanced dynamic object detection capabilities.
The evaluation of the tested dynamic SLAM systems highlights the significant influence of semantic methods on performance, particularly in handling dynamic environments. Among the systems evaluated, SG-SLAM (using NCNN) and DynaSLAM (leveraging Mask R-CNN) demonstrated superior performance. SG-SLAM excelled in indoor scenarios, such as those represented by the TUM dataset, where its robust semantic segmentation effectively filtered dynamic elements, achieving a notably low ATE of 0.019 m in the fr3_walking_xyz sequence compared to ORB-SLAM2’s 0.693 m. DynaSLAM, on the other hand, proved highly effective in both indoor and outdoor environments, as demonstrated by its robust results on the KITTI dataset. Its use of Mask R-CNN facilitated accurate detection and exclusion of dynamic objects, yielding reliable mapping and localization even in complex scenes, such as in the KITTI 00 sequence, where it achieved an ATE of 1.4 m.

Table 20. Average processing time per frame (ms).

SystemsAverage processing time per frame (ms)
ORB-SLAM259.26
SG-SLAM (NCNN)65.71
YOLO-SLAM (darknet19-yolov3)696.09
DS-SLAM (SegNet)59.4
DynaSLAM (Mask R-CNN)192.00
RDS-SLAM (Mask-RCNN)57.5
RDS-SLAM (Segnet)57.5
Conversely, systems like YOLO-SLAM (darknet19-yolov3) and ORB-SLAM2 underperformed in key areas. YOLO-SLAM, despite its potential for real-time processing, was hindered by high computational demands and less refined semantic segmentation, resulting in higher processing times (696.09 ms per frame) and reduced accuracy in dynamic scenarios. ORB-SLAM2, lacking semantic integration, consistently struggled to distinguish between static and dynamic elements, leading to significant errors in trajectory estimation and mapping in challenging environments.
In specific use cases, certain systems emerged as better suited for particular tasks. For real-time or resource-constrained applications, RDS-SLAM (SegNet) offered a balanced approach, achieving competitive accuracy, such as an ATE of 0.02 m in the fr3_walking_static sequence, while maintaining low processing times (57.5 ms per frame). For highly dynamic scenes, DynaSLAM with Mask R-CNN stood out as the most robust system due to its superior semantic capabilities. These findings underscore the critical role of semantic integration in dynamic SLAM, emphasizing the need for future research to focus on optimizing these methods for real-time applications while maintaining accuracy and robustness.

8.4.5. Challenges in replicating results using custom datasets

The replication of results in semantic SLAM, particularly when applying existing systems to custom datasets, poses significant challenges. While open-source systems provide the foundational tools, their effective deployment often requires addressing several technical and methodological hurdles. These challenges span dependencies and versioning issues, as well as pre-processing and postprocessing requirements.
One of the primary challenges in implementing semantic SLAM systems is the complexity associated with managing software dependencies and version compatibility. These systems typically rely on an integration of robotics frameworks, deep learning libraries, and hardware-specific configurations.
Robotics frameworks, such as DynaSLAM and SG-SLAM, are highly dependent on the Robot Operating System (ROS), which presents compatibility challenges due to its version-specific requirements with different operating systems and hardware platforms. Similarly, deep learning models employed for semantic segmentation, like Mask R-CNN and SegNet, necessitate specific versions of libraries such as TensorFlow or PyTorch. Inconsistencies in library versions can lead to model incompatibility or suboptimal performance. Additionally, hardware dependencies, including GPU compatibility, CUDA version alignment, and VRAM capacity, pose further obstacles, particularly for computationally intensive applications like YOLO-SLAM.
To address these issues, containerization technologies such as Docker are recommended, as they encapsulate all necessary dependencies within isolated environments, ensuring consistent performance across different systems. Moreover, comprehensive documentation detailing software and library version requirements can significantly enhance system reproducibility and compatibility.
Adapting custom datasets for integration with existing semantic SLAM systems often requires extensive preprocessing to meet the specific format and quality standards of these systems. Many SLAM frameworks are optimized for standardized datasets such as TUM or KITTI, necessitating the reformatting of custom datasets to include synchronized RGB-D or stereo image pairs, accurate timestamps, and ground truth pose information to ensure system compatibility. Additionally, real-world datasets commonly contain sensor noise, missing data, and other artifacts that can negatively impact performance. To address these issues, preprocessing techniques such as noise reduction, depth image interpolation, and data smoothing are essential for enhancing data quality and maintaining system accuracy. Furthermore, semantic SLAM systems rely heavily on precise object annotations for effective dynamic object detection and segmentation. Generating high-quality semantic labels for custom datasets is often a time-consuming process that requires a combination of automated annotation tools and manual verification to ensure both accuracy and consistency across the dataset.
After processing a custom dataset with a semantic SLAM system, several challenges emerge in postprocessing and result evaluation. One of the primary difficulties is the calculation of performance metrics, as custom datasets often lack predefined ground truth data necessary for evaluating standard SLAM metrics such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). In such cases, it becomes essential to generate high-quality ground truth data using alternative methods, such as motion capture systems or precise manual annotations, to ensure accurate performance assessment. Additionally, maintaining semantic alignment between the system’s outputs and the dataset’s object categories can be complex, particularly when the custom dataset employs a taxonomy that differs from that used by the pretrained models. This misalignment may require additional mapping or reclassification to ensure meaningful comparisons. Furthermore, effective visualization of the results is critical for comparing performance across different systems. Standardized visualization techniques are necessary to clearly illustrate differences in mapping accuracy, localization performance, and semantic segmentation outcomes, thereby facilitating a comprehensive evaluation of the system’s capabilities. By analyzing studies that have made their implementations publicly available, we can assess not only the reproducibility of proposed techniques but also their potential for real-world adoption and further advancements in Semantic SLAM research. Building on these insights, the next section discusses key open challenges and promising directions for future research in the field.

9. Future work directions

Future research in semantic SLAM should prioritize the development of adaptive model architectures capable of dynamic weight adjustment in response to scene variations. As environments transition from static indoor spaces to dynamic outdoor scenarios, semantic segmentation models must intelligently recalibrate their parameters to maintain optimal performance. This adaptability is crucial for handling the diverse conditions encountered in real-world deployments, from lighting changes to varying object densities and movement patterns.
Real-time environmental adaptation represents a critical challenge requiring sophisticated methodologies. Online learning frameworks and domain adaptation techniques offer promising solutions for enabling dynamic recalibration of VSLAM systems. These approaches allow models to continuously update their understanding based on incoming sensory data, ensuring robust performance even in fluctuating and uncertain contexts. Such adaptability is particularly vital for autonomous systems operating in unstructured environments where pre-trained models may encounter previously unseen scenarios.
Temporal coherence of semantic information emerges as another fundamental requirement for robust semantic SLAM. Maintaining consistent semantic labels across consecutive frames not only alleviates positioning inaccuracies caused by discontinuities but also strengthens the reliability of loop closure detection. This semantic consistency over time proves essential for long-term robotic operations, where systems must recognize previously visited locations despite temporal changes. Applications requiring continuous mobility in repetitive tasks particularly benefit from this temporal stability, as it enables more accurate global localization and map consistency.
Computational efficiency remains a primary concern for real-time semantic SLAM deployment. Future research must focus on developing lightweight semantic segmentation models that balance accuracy with resource constraints. Advanced optimization techniques, including network pruning, knowledge distillation, and quantization, show promise for creating models suitable for edge computing platforms. These approaches enable sophisticated semantic understanding on computationally limited autonomous robots without compromising real-time performance requirements.
Dynamic object handling presents unique challenges that demand specialized solutions. Integrating robust object detectors to identify and exclude moving entities from the mapping process significantly improves pose estimation accuracy. Dynamic environments, particularly those with low texture or repetitive patterns, pose substantial challenges as moving objects reduce the availability of reliable static features. Future systems should employ probabilistic approaches to distinguish between static and dynamic elements, potentially incorporating motion prediction models to anticipate and compensate for environmental changes.
The convergence of online learning and domain adaptation algorithms offers a pathway toward truly adaptive semantic SLAM systems. These techniques enable gradual model refinement in response to contextual changes encountered during extended operation periods. Online learning facilitates immediate incorporation of new environmental knowledge, while domain adaptation enables effective knowledge transfer between different operational contexts. This combination ensures that semantic SLAM systems remain accurate and relevant throughout their deployment lifecycle, adapting to seasonal changes, structural modifications, and evolving environmental conditions.
The ultimate objective is achieving seamless integration between semantic understanding and geometric mapping through end-to-end joint optimization. This holistic approach moves beyond treating semantic segmentation as an isolated module, instead fostering deep interconnections between all VSLAM components. Joint optimization frameworks should simultaneously refine semantic predictions, geometric estimates, and data associations, creating systems where each component benefits from and contributes to the others’ performance. Such integration promises more robust and reliable VSLAM systems capable of operating effectively in complex, real-world scenarios while maintaining both geometric accuracy and semantic understanding.

10. Conclusion

This comprehensive review has examined the evolution and current state of semantic SLAM, demonstrating how the integration of semantic understanding with traditional geometric mapping has transformed robotic perception and navigation. Through our analysis of various approaches across different sensor modalities, from monocular to multi-modal systems, we have highlighted both the significant advances achieved and the challenges that remain.
The evaluation of existing semantic SLAM systems reveals consistent improvements in robustness and accuracy when semantic information is properly integrated. The reproducibility studies conducted on benchmark datasets confirm that semantic-enhanced systems outperform traditional geometric SLAM in dynamic environments, though at the cost of increased computational complexity. These findings underscore the importance of balancing semantic richness with real-time performance requirements.
Several key insights emerge from this survey. First, the choice of sensor modality significantly impacts both the quality of semantic understanding and computational efficiency. While RGB-D and LiDAR systems provide rich geometric information, monocular approaches demonstrate surprising effectiveness when combined with advanced deep learning techniques. Second, the handling of dynamic environments remains a critical differentiator among approaches, with recent methods showing promising results through probabilistic modeling and temporal consistency constraints. Third, the gap between laboratory demonstrations and real-world deployment persists, particularly in terms of long-term reliability and computational constraints.
Several fundamental challenges must be addressed to realize the full potential of semantic SLAM. Dynamic model adaptation stands as a primary research direction, requiring systems that can adjust their parameters in response to environmental changes without manual intervention. Temporal coherence of semantic information presents another crucial area, as maintaining consistent semantic understanding over extended periods is essential for reliable long-term operation. Additionally, the development of lightweight yet accurate semantic segmentation models remains vital for deploying these systems on resource-constrained platforms.
The integration of online learning and domain adaptation techniques offers promising avenues for creating truly adaptive semantic SLAM systems. These approaches would enable continuous improvement and adaptation to new environments, moving beyond the current paradigm of fixed, pre-trained models. Furthermore, end-to-end joint optimization of semantic and geometric components represents the ultimate goal, promising systems where perception and mapping are seamlessly integrated rather than loosely coupled.
In conclusion, semantic SLAM has moved from research laboratories to practical implementation in autonomous systems. While significant challenges remain in computational efficiency, dynamic scene handling, and long-term reliability, the rapid progress documented in this survey suggests that robust, real-world semantic SLAM systems are within reach. As the field continues to mature, we anticipate that the convergence of advanced machine learning, efficient computing architectures, and novel algorithmic approaches will enable the next generation of intelligent robotic systems capable of truly understanding and navigating complex, dynamic environments. The future of autonomous navigation lies not just in knowing where things are, but in understanding what they are and how they relate to the robot’s objectives.

CRediT authorship contribution statement

Houssein Kanso: Writing – original draft, Visualization. Abhilasha Singh: Writing – original draft, Investigation, Formal analysis. Etaf El Zarif: Software, Validation, Writing – review & editing, Implementation. Nooruldeen Almohammed: Writing – original draft, Methodology, Investigation. Jinane Mounsef: Conceptualization, Writing – review & editing, Supervision, Project administration. Noel Maalouf: Writing – reviewing & editing, Conceptualization. Bilal Arain: Conceptualization, Writing – review & editing, Supervision.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

No data was used for the research described in the article.

References

Cited by (0)